Paper: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2

Meta's Tweet: https://twitter.com/MetaAI/status/1587467591068459008

Abstract

>Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth.

Comments

You must log in or register to comment.

timy2shoes t1_iunym8r wrote on November 1, 2022 at 7:32 PM

We've been testing out their embeddings for transfer learning tasks and they've been performing quite well. Better than previous embeddings that we have tested. The 15B parameter model though is a pain in the ass. Getting the embeddings requires a workaround that is difficult to implement. Probably not worth it in my opinion.

nivrams_brain t1_iuo3l24 wrote on November 1, 2022 at 8:03 PM

What kind of downstream tasks are you looking at?

timy2shoes t1_iuo4xa4 wrote on November 1, 2022 at 8:12 PM

ML-guided protein engineering.

nivrams_brain t1_iuo8d1h wrote on November 1, 2022 at 8:33 PM

Sounds cool, are you in academia or industry?

timy2shoes t1_iuoafb9 wrote on November 1, 2022 at 8:46 PM

Industry

MangoGuyyy t1_iuq81od wrote on November 2, 2022 at 5:48 AM

What company, I’m curious

[deleted] t1_iupief7 wrote on November 2, 2022 at 1:51 AM

[deleted]

ROFLLOLSTER t1_iupsskm wrote on November 2, 2022 at 3:11 AM

> requires a workaround that is difficult to implement

What workaround? I've also been working with ESM and tried the 15B parameter variant. It seemed worse than the 3B in my tests, but maybe I just missed the problem?

timy2shoes t1_iuptv7y wrote on November 2, 2022 at 3:20 AM

We had to do a workaround to fit the 15b parameter model on a p3.8xlarge instance.

> I've also been working with ESM and tried the 15B parameter variant.

Huh. We’ve noticed the same thing. Interesting that others are having the same problem.

Mister_Abc t1_iur4gme wrote on November 2, 2022 at 12:42 PM

First author here. We've had some indication that the 15B model may be overfit. It seemed to sightly improve on a few important metrics (casp14) which is why we included it.

farmingvillein t1_iuowrov wrote on November 1, 2022 at 11:17 PM

I'm not sad that they are doing this, in the sense that it is almost certainly net-good for humanity, but it is bizarre to me that MetaAI is investing here.

OnceReturned t1_iupkp7o wrote on November 2, 2022 at 2:08 AM

This is all working towards engineering proteins from scratch to do whatever you want. The potential impact of engineered proteins over the next hundred years is on the order of the impact of computers over the past hundred years. Meta and Alphabet and some others get this. The problem has two basic challenges:

Pick a biochemical function you want.

What structure provides that function?
What amino acid sequence yields that structure?

We're getting closer to figuring out the second thing with these structure prediction models. Once you can reliably answer those two questions, the world is your oyster. Want to catalyze hundreds of the most valuable reactions used in industrial chemical production, thereby lowering cost, increasing efficiency, increasing yield, and even opening entirely new avenues of chemical engineering? You can. Want to develop new classes of drugs to effectively treat hundreds of the highest priority diseases? You can. Want cheap sensors that can detect anything? Want to engineer perfect crops? Want to turn waste into fuel? Want to cheaply and easily construct and repair polymers? Want to make complex metamaterials? Want real, sophisticated nanotechnology? The list goes on, well into the unimaginable. And, once you can answer the two questions, it's super cheap to make arbitrary amino acid sequences.

Figuring it out would be like discovering fire for the first time. It's especially interesting because it will almost certainly happen and be virtually perfected within the next couple decades (at the latest, IMO).

farmingvillein t1_iupoett wrote on November 2, 2022 at 2:36 AM

To be super clear, I'm not questioning the overall utility! Strictly a statement of, I can't square this with metas mission statement.

OnceReturned t1_iupwvi9 wrote on November 2, 2022 at 3:47 AM

That's fair.

If I were someone with billions of dollars to burn on whatever moonshot R&D I could think of, it would, at least in large part, be on this stuff. So, I'm more inclined to wonder why everybody isn't working on it.

le4mu t1_iuqqb4g wrote on November 2, 2022 at 10:12 AM

How is the progress on the first question? The first question seems a fairy tale IMHO, but maybe because I am not in this domain. Could you provide more insights?

ynonym00s t1_iusru1v wrote on November 2, 2022 at 7:22 PM

@OnceReturned: These are naturally occurring proteins, no? For 2. to be solved, we would need to be able to predict structures for artificial sequences too? Moreover, don't we still need to predict structures in-vivo (inside the organism /environment where they are used)?

Singularian2501 t1_iuocjc1 wrote on November 1, 2022 at 8:59 PM

Github: https://github.com/huggingface/transformers/releases/tag/v4.24.0

Twitter: https://twitter.com/MetaAI/status/1587467591068459008

seraschka t1_iusvmp1 wrote on November 2, 2022 at 7:47 PM

This is super awesome stuff! But I would put a little asterisk on this for now. To get an idea of its real, unbiased accuracy, I wonder if they participated in CASP15 which is essentially the gold standard for assessing structure predictions. I think results will be released in December ... I guess we will know more about this next month.

[deleted] t1_iunb7ww wrote on November 1, 2022 at 5:05 PM

[deleted]

Lone-Pine t1_iusirtw wrote on November 2, 2022 at 6:25 PM

How is this different from AlphaFold?

gwyddonydd t1_iut86s5 wrote on November 2, 2022 at 9:07 PM

Quicker to run than AlphaFold but produces significantly less accurate models on average. For the very easiest cases they are probably roughly on par, though. To be honest, the speedup isn't really worth the loss in accuracy, especially when we already have a database of 230 million or so AlphaFold models to refer to.

[deleted] t1_iuop7gq wrote on November 1, 2022 at 10:24 PM

[deleted]

RemindMeBot t1_iuopekh wrote on November 1, 2022 at 10:25 PM

I will be messaging you in 1 day on 2022-11-02 22:24:03 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)