Surur t1_ja3lylr wrote on February 26, 2023 at 4:27 PM

What is really interesting about this is that the LLM may have a better understanding of what makes an enzyme function than the human scientists.

The danger is the science turning into a blackbox as dense as LLM themselves.

RabidHexley t1_jaa2ai7 wrote on February 27, 2023 at 11:05 PM

>The danger is the science turning into a blackbox as dense as LLM themselves.

Hopefully we can just ask the LLM for a rundown.

MysteryInc152 OP t1_ja3hn8q wrote on February 26, 2023 at 3:58 PM

>Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

vhu9644 t1_ja6wu9v wrote on February 27, 2023 at 7:54 AM

I know this is exciting (and it is) but just to temper the excitement: many computationally designed proteins have issues.

Most aren’t that good at working in in-Vivo conditions

We still can’t really adjust parameters we really want (like temperature these proteins work in)

Most are stuck on “simpler” problems like binding rather than enzymatic function

There may also be issues with evolvability of these enzymes

But all the same, it’s not an unnatural situation either. Protein sequences are still a sequence. Amino acids are added one by one to build them up, and we’ve known that neural nets are good at these problems. Before we solved tertiary structure prediction, secondary structure prediction sota was also neural networks. It’s just tertiary structure and these kinds of generative models are hard.

We’re finally cracking into generative protein design and the field is super exciting now, but it’s still only really preliminary results we’re seeing.

Facts_About_Cats t1_ja5q7ce wrote on February 27, 2023 at 1:07 AM

What does the structure language have to do with the folding shapes if proteins?

MysteryInc152 OP t1_ja5rsxd wrote on February 27, 2023 at 1:20 AM

It shouldn't as you understand it and that's why this is pretty huge. Whatever LLMs are learning during training is proving more and more to be the real deal.

throwaway_890i t1_jab66dd wrote on February 28, 2023 at 4:00 AM

Isn't this just the same kind of neural network that has been solving this kind of problem long before LLMs?

MysteryInc152 OP t1_jab6uae wrote on February 28, 2023 at 4:05 AM

Definitely not, no. This is the first time a language model is used to tackle this

dwarfarchist9001 t1_ja6cfn4 wrote on February 27, 2023 at 4:08 AM

This paper actually skips the folding step entirely. The AI was trained a list of protein amino acid sequences that were labeled with their purpose. Then they had it predict new amino acid sequences to fulfill the same purposes. Finally they actually made the proteins the model suggested and the proteins worked with quite high levels of efficiency.

The most interesting part to me is that some of the proteins suggested by model worked despite having little similarity to the proteins in the training data, as low 31.4% in one case. This suggests to me the model has caught on to some thus far unknown rules underlying the relationship between the sequences and functions of proteins.

blueSGL t1_ja6pgm2 wrote on February 27, 2023 at 6:20 AM

Listening to Neel Nanda talk about how models form structures to solve common problems presenting in training, no wonder they are able to pick up on patterns better than humans, that's what they are designed for.

and I believe that training models with no intention of running them purely to see what if any hidden underlying structures humanity has collectively missed is called something like 'microscope AI '

RabidHexley t1_jaa3go2 wrote on February 27, 2023 at 11:13 PM

> purely to see what if any hidden underlying structures humanity has collectively missed

This is one of the things I feel has real potential even for "narrow" AI as far as expanding human knowledge. Something may very well be within the scope of known human science without humans ever realizing it. If you represented all human knowledge as a sphere it'd probably have a composition as porous as a sponge.

AI doesn't necessarily need to be able to reason "beyond" current human understanding to expand upon known science, but simply make connections we're unable to see.

Facts_About_Cats t1_ja8q9at wrote on February 27, 2023 at 5:58 PM

There is no reason why the physical structure of proteins should in any way resemble or be related to the structure and grammar of the associations and relationships between words.

Jcat49er t1_ja96hy5 wrote on February 27, 2023 at 7:40 PM

That’s the problem though. According to the results of this and other papers, there is a still unknown relationship between proteins that AIs are able to recognize and manipulate. It just happens that the way AI find the patterns in human language can also be used to find the structure of proteins.

diabeetis t1_jac6a4k wrote on February 28, 2023 at 11:20 AM

I don't see why it shouldn't. It abstracts meaning from the relationships in the data, whether it's language or sequences

turnip_burrito t1_ja6mrbg wrote on February 27, 2023 at 5:50 AM

Spooky model magic.

hackinthebochs t1_ja6uapk wrote on February 27, 2023 at 7:20 AM

Any structured data is a language in a broad sense. Tokens identify structural units and the grammar determine how these structural units interrelate. But the grammar can be arbitrarily complex and so can encode deep relationships among data in any domain. This is why "language models" are so powerful in a vast array of contexts.

eve_of_distraction t1_ja9c6cn wrote on February 27, 2023 at 8:16 PM

One step closer to curing the dreaded prion diseases. One day. 🙏

Large language models generate functional protein sequences across diverse families

Comments