OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
14 May 2024
Historique:
received: 14 08 2023
accepted: 03 04 2024
medline: 15 5 2024
pubmed: 15 5 2024
entrez: 14 5 2024
Statut: aheadofprint

Résumé

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Identifiants

pubmed: 38744917
doi: 10.1038/s41592-024-02272-z
pii: 10.1038/s41592-024-02272-z
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences (NIGMS)
ID : R35GM150546
Organisme : U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences (NIGMS)
ID : R35GM150546
Organisme : U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences (NIGMS)
ID : R35GM150546
Organisme : U.S. Department of Health & Human Services | NIH | National Cancer Institute (NCI)
ID : U54-CA225088
Organisme : National Science Foundation (NSF)
ID : OAC-2106661
Organisme : National Science Foundation (NSF)
ID : OAC-2112606

Informations de copyright

© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Anfinsen, C. B. Principles that govern the folding of protein chains. Science 181, 223–230 (1973).
doi: 10.1126/science.181.4096.223 pubmed: 4124164
Dill, K. A., Ozkan, S. B., Shell, M. S. & Weikl, T. R. The protein folding problem. Annu. Rev. Biophys. 37, 289–316 (2008).
doi: 10.1146/annurev.biophys.37.092707.153558 pubmed: 18573083 pmcid: 2443096
Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006 (2015).
doi: 10.1093/bioinformatics/btu791 pubmed: 25431331
Golkov, V. et al. Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images. In Advances in Neural Information Processing Systems (eds Lee, D. et al.) (Curran Associates, 2016).
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
doi: 10.1371/journal.pcbi.1005324 pubmed: 28056090 pmcid: 5249242
Liu, Y., Palmedo, P., Ye, Q., Berger, B. & Peng, J. Enhancing evolutionary couplings with deep convolutional neural networks. Cell Syst. 6, 65–74 (2018).
doi: 10.1016/j.cels.2017.11.014 pubmed: 29275173
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
doi: 10.1038/s41586-019-1923-7 pubmed: 31942072
Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).
doi: 10.1038/s42256-021-00348-5 pubmed: 34368623 pmcid: 8340610
Šali, A. & Blundell, T. L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
doi: 10.1006/jmbi.1993.1626 pubmed: 8254673
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5, 725–738 (2010).
doi: 10.1038/nprot.2010.5 pubmed: 20360767 pmcid: 2849174
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 577, 583–589 (2021).
doi: 10.1038/s41586-021-03819-2
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
doi: 10.1038/s41592-022-01488-1 pubmed: 35637307 pmcid: 9184281
Baek, M. Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). Twitter twitter.com/minkbaek/status/1417538291709071362?lang=en (2021).
Tsaban, T. et al. Harnessing protein folding neural networks for peptide–protein docking. Nat. Commun. 13, 176 (2022).
doi: 10.1038/s41467-021-27838-9 pubmed: 35013344 pmcid: 8748686
Roney, J. P. & Ovchinnikov, S. State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101 (2022).
doi: 10.1103/PhysRevLett.129.238101 pubmed: 36563190
Baltzis, A. et al. Highly significant improvement of protein sequence alignments with AlphaFold2. Bioinformatics 38, 5007–5011 (2022).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
doi: 10.1038/s41467-022-28865-w pubmed: 35273146 pmcid: 8913741
Wayment-Steele, H. K., Ovchinnikov, S., Colwell, L. & Kern, D. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. Nature 625, 832–839 (2024).
doi: 10.1038/s41586-023-06832-9 pubmed: 37956700
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
doi: 10.1038/s41586-021-03828-1 pubmed: 34293799 pmcid: 8387240
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
doi: 10.1093/nar/gkab1061 pmcid: 8728224
Callaway, E. ‘The entire protein universe’: AI predicts shape of nearly every known protein. Nature 608, 15–16 (2022).
doi: 10.1038/d41586-022-02083-2 pubmed: 35902752
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Ahdritz, G. et al. OpenProteinSet: training data for structural biology at scale. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 4597-4609 (Curran Associates, 2023).
Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub github.com/google/jax (2018).
Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20 3505–3506 (Association for Computing Machinery, 2020).
Charlier, B., Feydy, J., Glaunès, J., Collin, F.-D. & Durif, G. Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22, 1–6 (2021).
Falcon, W. & the PyTorch Lightning team. PyTorch Lightning (PyTorch Lightning, 2019).
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 16344–16359 (Curran Associates, 2022).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
doi: 10.1093/nar/gkw1081 pubmed: 27899574
wwPDB Consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2018).
doi: 10.1093/nar/gky949
Haas, J. ürgen et al. Continuous automated model evaluation (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86, 387–398 (2018).
doi: 10.1002/prot.25431 pubmed: 29178137
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).
doi: 10.1093/bioinformatics/btt473 pubmed: 23986568 pmcid: 3799472
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
doi: 10.1016/S0969-2126(97)00260-8 pubmed: 9309224
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
doi: 10.1093/nar/gkaa1079 pubmed: 33237325
Andreeva, A., Kulesha, E., Gough, J. & Murzin, A. G. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 48, D376–D382 (2020).
doi: 10.1093/nar/gkz1064 pubmed: 31724711
Saitoh, Y. et al. Structural basis for high selectivity of a rice silicon channel Lsi1. Nat. Commun. 12, 6236 (2021).
doi: 10.1038/s41467-021-26535-x pubmed: 34716344 pmcid: 8556265
Mota, DaniellyC. A. M. et al. Structural and thermodynamic analyses of human TMED1 (p241) Golgi dynamics. Biochimie 192, 72–82 (2022).
doi: 10.1016/j.biochi.2021.10.002 pubmed: 34634369
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Rabe, M. N. & Staats, C. Self-attention does not need O(n
Cheng, S. et al. FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming 417–430 (Association for Computing Machinery, 2024).
Li, Z. et al. Uni-Fold: an open-source platform for developing protein folding models beyond AlphaFold. Preprint at bioRxiv https://doi.org/10.1101/2022.08.04.502811 (2022).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Science 22, 2577–2637 (1983).
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).
doi: 10.1093/nar/gkg571 pubmed: 12824330 pmcid: 168977
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).
doi: 10.1371/journal.pone.0028766 pubmed: 22163331 pmcid: 3233603
Sułkowska, J. I., Morcos, F., Weigt, M., Hwa, T. & Onuchic, José Genomics-aided structure prediction. Proc. Natl Acad. Sci. USA 109, 10340–10345 (2012).
doi: 10.1073/pnas.1207864109 pubmed: 22691493 pmcid: 3387073
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (eds Oh, A. H. et al.) 30016–30030 (NeurIPS, 2022).
Tay, Y. et al. Scaling laws vs model architectures: how does inductive bias influence scaling? In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 12342–12364 (Association for Computational Linguistics, 2023).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
doi: 10.1126/science.ade2574 pubmed: 36927031
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
doi: 10.1038/s41592-019-0598-1 pubmed: 31636460 pmcid: 7067682
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).
Singh, J., Paliwal, K., Litfin, T., Singh, J. & Zhou, Y. Predicting RNA distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling. Bioinformatics 38, 3900–3910 (2022).
doi: 10.1093/bioinformatics/btac421 pubmed: 35751593 pmcid: 9364379
Baek, M., McHugh, R., Anishchenko, I., Baker, D. & DiMaio, F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).
doi: 10.1038/s41592-023-02086-5 pubmed: 37996753
Pearce, R., Omenn, G. S. & Zhang, Y. De novo RNA tertiary structure prediction at atomic resolution using geometric potentials from deep learning. Preprint at bioRxiv https://doi.org/10.1101/2022.05.15.491755 (2022).
McPartlon, M., Lai, B. & Xu, J. A deep SE(3)-equivariant model for learning inverse protein folding. Preprint at bioRxiv https://doi.org/10.1101/2022.04.15.488492 (2022).
McPartlon, M. & Xu, J. An end-to-end deep learning method for protein side-chain packing and inverse folding. In Proceedings of the National Academy of Sciences e2216438120 (PNAS, 2023).
Knox, H. L., Sinner, E. K., Townsend, C. A., Boal, A. K. & Booker, S. J. Structure of a B
doi: 10.1038/s41586-021-04392-4 pubmed: 35110734 pmcid: 8950224
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
doi: 10.1002/prot.20264 pubmed: 15476259
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE Press, 2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Wang, G. et al. HelixFold: an efficient implementation of AlphaFold2 using PaddlePaddle. Preprint at https://doi.org/10.48550/arXiv.2207.05477 (2022).
Yuan, J. et al. OneFlow: redesign the distributed deep learning framework from scratch. Preprint at https://doi.org/10.48550/arXiv.2110.15032 (2021).
Ovchinnikov, S. Weekend project! nerd-face So now that OpenFold weights are available. I was curious how different they are from AlphaFold weights and if they can be used for AfDesign evaluation. More specifically, if you design a protein with AlphaFold, can OpenFold predict it (and vice-versa)? (1/5). Twitter twitter.com/sokrypton/status/1551242121528520704?lang=en (2022).
Wei, X. et al. The α-helical cap domain of a novel esterase from gut Alistipes shahii shaping the substrate-binding pocket. J. Agric. Food Chem. 69, 6064–6072 (2021).
doi: 10.1021/acs.jafc.1c00940 pubmed: 33979121
Carroll, B. L. et al. Caught in motion: human NTHL1 undergoes interdomain rearrangement necessary for catalysis. Nucleic Acids Res. 49, 13165–13178 (2021).
doi: 10.1093/nar/gkab1162 pubmed: 34871433 pmcid: 8682792

Auteurs

Gustaf Ahdritz (G)

Department of Systems Biology, Columbia University, New York, NY, USA.
Harvard University, Cambridge, MA, USA.

Nazim Bouatta (N)

Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA. nbouatta@gmail.com.

Christina Floristean (C)

Department of Systems Biology, Columbia University, New York, NY, USA.

Sachin Kadyan (S)

Department of Systems Biology, Columbia University, New York, NY, USA.

Qinghui Xia (Q)

Department of Systems Biology, Columbia University, New York, NY, USA.

William Gerecke (W)

Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.

Timothy J O'Donnell (TJ)

Icahn School of Medicine at Mount Sinai, New York, NY, USA.

Daniel Berenberg (D)

Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.

Ian Fisk (I)

Flatiron Institute, New York, NY, USA.

Niccolò Zanichelli (N)

OpenBioML, Cambridge, MA, USA.

Bo Zhang (B)

Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA.

Arkadiusz Nowaczynski (A)

NVIDIA, Santa Clara, CA, USA.

Bei Wang (B)

NVIDIA, Santa Clara, CA, USA.

Marta M Stepniewska-Dziubinska (MM)

NVIDIA, Santa Clara, CA, USA.

Shang Zhang (S)

NVIDIA, Santa Clara, CA, USA.

Murat Efe Guney (ME)

NVIDIA, Santa Clara, CA, USA.

Stella Biderman (S)

EleutherAI, New York, NY, USA.
Booz Allen Hamilton, McLean, VA, USA.

Andrew M Watkins (AM)

Prescient Design, Genentech, New York, NY, USA.

Stephen Ra (S)

Prescient Design, Genentech, New York, NY, USA.

Pablo Ribalta Lorenzo (PR)

NVIDIA, Santa Clara, CA, USA.

Lucas Nivon (L)

Cyrus Bio, Seattle, WA, USA.

Brian Weitzner (B)

Outpace Bio, Seattle, WA, USA.

Yih-En Andrew Ban (YA)

Arzeda, Seattle, WA, USA.

Shiyang Chen (S)

Rutgers University, New Brunswick, NJ, USA.

Minjia Zhang (M)

University of Illinois at Urbana-Champaign, Champaign, IL, USA.

Conglong Li (C)

Microsoft, Redmond, WA, USA.

Shuaiwen Leon Song (SL)

Microsoft, Redmond, WA, USA.

Yuxiong He (Y)

Microsoft, Redmond, WA, USA.

Peter K Sorger (PK)

Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.

Emad Mostaque (E)

Stability AI, Los Altos, CA, USA.

Zhao Zhang (Z)

Rutgers University, New Brunswick, NJ, USA.

Richard Bonneau (R)

Prescient Design, Genentech, New York, NY, USA.

Mohammed AlQuraishi (M)

Department of Systems Biology, Columbia University, New York, NY, USA. m.alquraishi@columbia.edu.

Classifications MeSH