Protein structure generation via folding diffusion.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
05 Feb 2024
Historique:
received: 18 07 2023
accepted: 12 01 2024
medline: 6 2 2024
pubmed: 6 2 2024
entrez: 5 2 2024
Statut: epublish

Résumé

The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a diffusion-based generative model that generates protein backbone structures via a procedure inspired by the natural folding process. We describe a protein backbone structure as a sequence of angles capturing the relative orientation of the constituent backbone atoms, and generate structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins natively twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for more complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release an open-source codebase and trained models for protein structure diffusion.

Identifiants

pubmed: 38316764
doi: 10.1038/s41467-024-45051-2
pii: 10.1038/s41467-024-45051-2
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

1059

Informations de copyright

© 2024. The Author(s).

Références

Zhou, Q. et al. The primed SNARE–complexin–synaptotagmin complex for neuronal exocytosis. Nature 548, 420–425 (2017).
pubmed: 28813412 pmcid: 5757840 doi: 10.1038/nature23484
Mariuzza, R., Phillips, S. & Poljak, R. The structural basis of antigen-antibody recognition. Annu. Rev. Biophys. Biophys. Chem. 16, 139–159 (1987).
pubmed: 2439094 doi: 10.1146/annurev.bb.16.060187.001035
Bonora, M. et al. ATP synthesis and storage. Purinergic Signal. 8, 343–357 (2012).
pubmed: 22528680 pmcid: 3360099 doi: 10.1007/s11302-012-9305-8
Dominguez, R. & Holmes, K. C. Actin structure and function. Annu. Rev. Biophys. 40, 169 (2011).
pubmed: 21314430 pmcid: 3130349 doi: 10.1146/annurev-biophys-042910-155359
Chaudhuri, T. K. & Paul, S. Protein-misfolding diseases and chaperone-based therapeutic approaches. FEBS J. 273, 1331–1349 (2006).
pubmed: 16689923 doi: 10.1111/j.1742-4658.2006.05181.x
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
pubmed: 18097458 doi: 10.1038/nrd2399
Kamionka, M. Engineering of therapeutic proteins production in Escherichia coli. Curr. Pharm. Biotechnol. 12, 268–274 (2011).
pubmed: 21050165 pmcid: 3179032 doi: 10.2174/138920111794295693
Dimitrov, D. S. Therapeutic proteins. Methods Mol. Biol. 899, 1–26 (2012).
pubmed: 22735943 pmcid: 6988726 doi: 10.1007/978-1-61779-921-1_1
Tobin, P. H. et al. Protein engineering: a new frontier for biological therapeutics. Curr. Drug Metab. 15, 743–756 (2014).
pubmed: 25495737 pmcid: 4931902 doi: 10.2174/1389200216666141208151524
Schenkelberg, C. D. & Bystroff, C. Protein backbone ensemble generation explores the local structural space of unseen natural homologs. Bioinformatics 32, 1454–1461 (2016).
pubmed: 26787668 pmcid: 5006151 doi: 10.1093/bioinformatics/btw001
Holm, L. & Sander, C. Database algorithm for generating protein backbone and side-chain co-ordinates from a C[Formula: see text] trace: Application to model building and detection of co-ordinate errors. J. Mol. Biol. 218, 183–194 (1991).
pubmed: 2002501 doi: 10.1016/0022-2836(91)90883-8
Anand, N., Eguchi, R. & Huang, P.-S. Fully differentiable full-atom protein backbone generation. In: DGS@ICLR (2019).
Lee, J. S. & Kim, P. M. ProteinSGM: score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv https://arxiv.org/abs/2205.15019 (2022).
Trippe, B. L. et al. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv https://arxiv.org/abs/2206.04119 (2022).
Luo, S. et al. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. bioRxiv https://doi.org/10.1101/2022.07.10.499510 (2022).
Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).
pubmed: 35759518 pmcid: 9269947 doi: 10.1371/journal.pcbi.1010271
Watson, J. L. et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv https://www.biorxiv.org/content/10.1101/2022.12.09.519842v1 (2022).
Lin, Y. & AlQuraishi, M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv https://arxiv.org/abs/2301.12485 (2023).
Šali, A., Shakhnovich, E. & Karplus, M. How does a protein fold. Nature 369, 248–251 (1994).
pubmed: 7710478 doi: 10.1038/369248a0
Englander, S. W., Mayne, L. & Krishna, M. M. Protein folding and misfolding: mechanism and principles. Q. Rev. Biophys. 40, 1–41 (2007).
doi: 10.1017/S0033583508004654
Gao, Y., Wang, S., Deng, M. & Xu, J. Real-value and confidence prediction of protein backbone dihedral angles through a hybrid method of clustering and deep learning. arXiv https://arxiv.org/abs/1712.07244 (2017).
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Systems 8, 292–301 (2019).
pubmed: 31005579 pmcid: 6513320 doi: 10.1016/j.cels.2019.03.006
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Sabban, S. & Markovsky, M. RamaNet: computational de novo helical protein backbone design using a long short-term memory generative neural network. bioRxiv https://www.biorxiv.org/content/10.1101/671552v4 (2020).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning 2256–2265 (PMLR, 2015).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv https://arxiv.org/abs/2205.11487 (2022).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10684–10695 (2022).
Rouard, S. & Hadjeres, G. CRASH: raw audio score-based generative modeling for controllable high-resolution drum sound synthesis. arXiv https://arxiv.org/pdf/2106.07431.pdf (2021).
Kong, Z., Ping, W., Huang, J., Zhao, K. & Catanzaro, B. DiffWave: a versatile diffusion model for audio synthesis. In: International conference on learning representations (2021).
Dhariwal, P. & Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021).
Nichol, A. & Dhariwal, P. Improved denoising diffusion probabilistic models. In: International conference on machine learning 8162–8171 (PMLR, 2021).
Parsons, J., Holmes, J. B., Rojas, J. M., Tsai, J. & Strauss, C. E. Practical conversion from torsion space to cartesian space for in silico protein synthesis. J. Comput. Chem. 26, 1063–1068 (2005).
pubmed: 15898109 doi: 10.1002/jcc.20237
Sillitoe, I. et al. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43, D376–D381 (2015).
pubmed: 25348408 doi: 10.1093/nar/gku947
Ramachandran, G. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–437 (1968).
pubmed: 4882249 doi: 10.1016/S0065-3233(08)60402-7
Cintas, P. Chirality of living systems: a helping hand from crystals and oligopeptides. Angew. Chem. Int. Ed. Engl. 41, 1139–1145 (2002).
pubmed: 12491241 doi: 10.1002/1521-3773(20020402)41:7<1139::AID-ANIE1139>3.0.CO;2-9
Labesse, G., Colloc’h, N., Pothier, J. & Mornon, J.-P. P-SEA: a new efficient assignment of secondary structure from C[Formula: see text] trace of proteins. Bioinformatics 13, 291–295 (1997).
doi: 10.1093/bioinformatics/13.3.291
Harder, T., Borg, M., Boomsma, W., Røgen, P. & Hamelryck, T. Fast large-scale clustering of protein structures using gauss integrals. Bioinformatics 28, 510–515 (2012).
pubmed: 22199383 doi: 10.1093/bioinformatics/btr692
Borg, M. et al. A probabilistic approach to protein structure prediction: PHAISTOS in CASP9. In: LASR2009-Statistical tools for challenges in bioinformatics 65–70 (2009).
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv https://arxiv.org/abs/1802.03426 (2018).
Black, S. et al. Gpt-neox-20b: an open-source autoregressive language model. arXiv https://arxiv.org/abs/2204.06745 (2022).
Artetxe, M. et al. Efficient large scale language modeling with mixtures of experts. arXiv https://arxiv.org/abs/2112.10684 (2021).
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
pubmed: 33893299 pmcid: 8065141 doi: 10.1038/s41467-021-22732-w
Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F. & Weigt, M. Efficient generative modeling of protein sequences using simple autoregressive models. Nature Commun. 12, 5800 (2021).
doi: 10.1038/s41467-021-25756-4
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
pubmed: 35896542 pmcid: 9329459 doi: 10.1038/s41467-022-32007-7
Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
pubmed: 36108050 pmcid: 9997061 doi: 10.1126/science.add2187
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. bioRxiv https://doi.org/10.1101/2022.07.21.500999 . (2022).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
pubmed: 15849316 pmcid: 1084323 doi: 10.1093/nar/gki524
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
pubmed: 15476259 doi: 10.1002/prot.20264
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. 117, 1496–1503 (2020).
pubmed: 31896580 pmcid: 6983395 doi: 10.1073/pnas.1914677117
Chakravarty, D. & Porter, L. L. AlphaFold2 fails to predict protein fold switching. Protein Sci. 31, e4353 (2022).
pubmed: 35634782 pmcid: 9134877 doi: 10.1002/pro.4353
Lane, T. J. Protein structure prediction has reached the single-structure frontier. Nat. Methods 20, 170–173 (2023).
Brotzakis, Z. F., Zhang, S. & Vendruscolo, M. AlphaFold prediction of structural ensembles of disordered proteins. bioRxiv https://doi.org/10.1101/2023.01.19.524720 .(2023)
Jing, B., Corso, G., Chang, J., Barzilay, R. & Jaakkola, T. Torsional diffusion for molecular conformer generation. arXiv https://arxiv.org/abs/2206.01729 (2022).
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, https://arxiv.org/abs/1706.03762 (2017).
Shaw, P., Uszkoreit, J. & Vaswani, A. Self-attention with relative position representations. arXiv https://arxiv.org/abs/1803.02155 (2018).
Tancik, M. et al. Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. 33, 7537–7547 (2020).
Song, Y. et al. Score-based generative modeling through stochastic differential equations. arXiv https://arxiv.org/abs/2011.13456 (2020).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). arXiv https://arxiv.org/abs/1606.08415 (2016).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In: International conference on learning representations (2019).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Hsu, C. et al. Learning inverse folding from millions of predicted structures. In: International conference on machine learning 8946–8970 (PMLR, 2022).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
Schrödinger, L. L. C. The PyMOL molecular graphics system, version 1.8. (2015).
Corey, R. B. & Pauling, L. C. Fundamental dimensions of polypeptide chains. Proc. R. Soc. Lond. B-Biol. Sci. 141, 10–20 (1953).
pubmed: 13047262 doi: 10.1098/rspb.1953.0011
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, 32 (eds. Wallach, H. et al.) 8024–8035 (Curran Associates, Inc., 2019).
Falcon, W. & The PyTorch Lightning team. PyTorch Lightning https://doi.org/10.5281/zenodo.3828935 . (2019)
Kunzmann, P. & Hamacher, K. Biotite: a unifying open source computational biology framework in python. BMC Bioinformatics 19, 1–8 (2018).
doi: 10.1186/s12859-018-2367-z
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
pubmed: 32939066 pmcid: 7759461 doi: 10.1038/s41586-020-2649-2
team, T. pandas development. Pandas-dev/pandas: pandas https://doi.org/10.5281/zenodo.3509134 . (2020)
McKinney, Wes. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference (eds. Walt, Stéfan van der & Millman, Jarrod) 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a .
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
doi: 10.1109/MCSE.2007.55
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
doi: 10.21105/joss.03021
Teeter, M. M. Water structure of a hydrophobic protein at atomic resolution: pentagon rings of water molecules in crystals of crambin. Proc. Natl. Acad. Sci. 81, 6014–6018 (1984).
pubmed: 16593516 pmcid: 391849 doi: 10.1073/pnas.81.19.6014
van.Bondi, A. Van der waals volumes and radii. J. Phys. Chem. 68, 441–451 (1964).
doi: 10.1021/j100785a001
Huang, X., Pearce, R. & Zhang, Y. FASPR: an open-source tool for fast and accurate protein side-chain packing. Bioinformatics 36, 3758–3765 (2020).
pubmed: 32259206 pmcid: 7320614 doi: 10.1093/bioinformatics/btaa234
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using rosetta. Bioinformatics 26, 689–691 (2010).
pubmed: 20061306 pmcid: 2828115 doi: 10.1093/bioinformatics/btq007

Auteurs

Kevin E Wu (KE)

Department of Computer Science, Stanford University, Stanford, CA, USA.
Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA.
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA.

Kevin K Yang (KK)

Microsoft Research, Cambridge, MA, USA.

Rianne van den Berg (R)

Microsoft Research, Amsterdam, Netherlands.

Sarah Alamdari (S)

Microsoft Research, Cambridge, MA, USA.

James Y Zou (JY)

Department of Computer Science, Stanford University, Stanford, CA, USA.
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA.

Alex X Lu (AX)

Microsoft Research, Cambridge, MA, USA.

Ava P Amini (AP)

Microsoft Research, Cambridge, MA, USA. ava.amini@microsoft.com.

Classifications MeSH