Protein structure generation via folding diffusion.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
05 Feb 2024
05 Feb 2024
Historique:
received:
18
07
2023
accepted:
12
01
2024
medline:
6
2
2024
pubmed:
6
2
2024
entrez:
5
2
2024
Statut:
epublish
Résumé
The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a diffusion-based generative model that generates protein backbone structures via a procedure inspired by the natural folding process. We describe a protein backbone structure as a sequence of angles capturing the relative orientation of the constituent backbone atoms, and generate structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins natively twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for more complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release an open-source codebase and trained models for protein structure diffusion.
Identifiants
pubmed: 38316764
doi: 10.1038/s41467-024-45051-2
pii: 10.1038/s41467-024-45051-2
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
1059Informations de copyright
© 2024. The Author(s).
Références
Zhou, Q. et al. The primed SNARE–complexin–synaptotagmin complex for neuronal exocytosis. Nature 548, 420–425 (2017).
pubmed: 28813412
pmcid: 5757840
doi: 10.1038/nature23484
Mariuzza, R., Phillips, S. & Poljak, R. The structural basis of antigen-antibody recognition. Annu. Rev. Biophys. Biophys. Chem. 16, 139–159 (1987).
pubmed: 2439094
doi: 10.1146/annurev.bb.16.060187.001035
Bonora, M. et al. ATP synthesis and storage. Purinergic Signal. 8, 343–357 (2012).
pubmed: 22528680
pmcid: 3360099
doi: 10.1007/s11302-012-9305-8
Dominguez, R. & Holmes, K. C. Actin structure and function. Annu. Rev. Biophys. 40, 169 (2011).
pubmed: 21314430
pmcid: 3130349
doi: 10.1146/annurev-biophys-042910-155359
Chaudhuri, T. K. & Paul, S. Protein-misfolding diseases and chaperone-based therapeutic approaches. FEBS J. 273, 1331–1349 (2006).
pubmed: 16689923
doi: 10.1111/j.1742-4658.2006.05181.x
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
pubmed: 18097458
doi: 10.1038/nrd2399
Kamionka, M. Engineering of therapeutic proteins production in Escherichia coli. Curr. Pharm. Biotechnol. 12, 268–274 (2011).
pubmed: 21050165
pmcid: 3179032
doi: 10.2174/138920111794295693
Dimitrov, D. S. Therapeutic proteins. Methods Mol. Biol. 899, 1–26 (2012).
pubmed: 22735943
pmcid: 6988726
doi: 10.1007/978-1-61779-921-1_1
Tobin, P. H. et al. Protein engineering: a new frontier for biological therapeutics. Curr. Drug Metab. 15, 743–756 (2014).
pubmed: 25495737
pmcid: 4931902
doi: 10.2174/1389200216666141208151524
Schenkelberg, C. D. & Bystroff, C. Protein backbone ensemble generation explores the local structural space of unseen natural homologs. Bioinformatics 32, 1454–1461 (2016).
pubmed: 26787668
pmcid: 5006151
doi: 10.1093/bioinformatics/btw001
Holm, L. & Sander, C. Database algorithm for generating protein backbone and side-chain co-ordinates from a C[Formula: see text] trace: Application to model building and detection of co-ordinate errors. J. Mol. Biol. 218, 183–194 (1991).
pubmed: 2002501
doi: 10.1016/0022-2836(91)90883-8
Anand, N., Eguchi, R. & Huang, P.-S. Fully differentiable full-atom protein backbone generation. In: DGS@ICLR (2019).
Lee, J. S. & Kim, P. M. ProteinSGM: score-based generative modeling for de novo protein design. Nat. Comput. Sci. 3, 382–392 (2023).
Anand, N. & Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv https://arxiv.org/abs/2205.15019 (2022).
Trippe, B. L. et al. Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. arXiv https://arxiv.org/abs/2206.04119 (2022).
Luo, S. et al. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. bioRxiv https://doi.org/10.1101/2022.07.10.499510 (2022).
Eguchi, R. R., Choe, C. A. & Huang, P.-S. Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput. Biol. 18, e1010271 (2022).
pubmed: 35759518
pmcid: 9269947
doi: 10.1371/journal.pcbi.1010271
Watson, J. L. et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv https://www.biorxiv.org/content/10.1101/2022.12.09.519842v1 (2022).
Lin, Y. & AlQuraishi, M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv https://arxiv.org/abs/2301.12485 (2023).
Šali, A., Shakhnovich, E. & Karplus, M. How does a protein fold. Nature 369, 248–251 (1994).
pubmed: 7710478
doi: 10.1038/369248a0
Englander, S. W., Mayne, L. & Krishna, M. M. Protein folding and misfolding: mechanism and principles. Q. Rev. Biophys. 40, 1–41 (2007).
doi: 10.1017/S0033583508004654
Gao, Y., Wang, S., Deng, M. & Xu, J. Real-value and confidence prediction of protein backbone dihedral angles through a hybrid method of clustering and deep learning. arXiv https://arxiv.org/abs/1712.07244 (2017).
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Systems 8, 292–301 (2019).
pubmed: 31005579
pmcid: 6513320
doi: 10.1016/j.cels.2019.03.006
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Sabban, S. & Markovsky, M. RamaNet: computational de novo helical protein backbone design using a long short-term memory generative neural network. bioRxiv https://www.biorxiv.org/content/10.1101/671552v4 (2020).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning 2256–2265 (PMLR, 2015).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv https://arxiv.org/abs/2205.11487 (2022).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10684–10695 (2022).
Rouard, S. & Hadjeres, G. CRASH: raw audio score-based generative modeling for controllable high-resolution drum sound synthesis. arXiv https://arxiv.org/pdf/2106.07431.pdf (2021).
Kong, Z., Ping, W., Huang, J., Zhao, K. & Catanzaro, B. DiffWave: a versatile diffusion model for audio synthesis. In: International conference on learning representations (2021).
Dhariwal, P. & Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021).
Nichol, A. & Dhariwal, P. Improved denoising diffusion probabilistic models. In: International conference on machine learning 8162–8171 (PMLR, 2021).
Parsons, J., Holmes, J. B., Rojas, J. M., Tsai, J. & Strauss, C. E. Practical conversion from torsion space to cartesian space for in silico protein synthesis. J. Comput. Chem. 26, 1063–1068 (2005).
pubmed: 15898109
doi: 10.1002/jcc.20237
Sillitoe, I. et al. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43, D376–D381 (2015).
pubmed: 25348408
doi: 10.1093/nar/gku947
Ramachandran, G. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–437 (1968).
pubmed: 4882249
doi: 10.1016/S0065-3233(08)60402-7
Cintas, P. Chirality of living systems: a helping hand from crystals and oligopeptides. Angew. Chem. Int. Ed. Engl. 41, 1139–1145 (2002).
pubmed: 12491241
doi: 10.1002/1521-3773(20020402)41:7<1139::AID-ANIE1139>3.0.CO;2-9
Labesse, G., Colloc’h, N., Pothier, J. & Mornon, J.-P. P-SEA: a new efficient assignment of secondary structure from C[Formula: see text] trace of proteins. Bioinformatics 13, 291–295 (1997).
doi: 10.1093/bioinformatics/13.3.291
Harder, T., Borg, M., Boomsma, W., Røgen, P. & Hamelryck, T. Fast large-scale clustering of protein structures using gauss integrals. Bioinformatics 28, 510–515 (2012).
pubmed: 22199383
doi: 10.1093/bioinformatics/btr692
Borg, M. et al. A probabilistic approach to protein structure prediction: PHAISTOS in CASP9. In: LASR2009-Statistical tools for challenges in bioinformatics 65–70 (2009).
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv https://arxiv.org/abs/1802.03426 (2018).
Black, S. et al. Gpt-neox-20b: an open-source autoregressive language model. arXiv https://arxiv.org/abs/2204.06745 (2022).
Artetxe, M. et al. Efficient large scale language modeling with mixtures of experts. arXiv https://arxiv.org/abs/2112.10684 (2021).
Shin, J.-E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 2403 (2021).
pubmed: 33893299
pmcid: 8065141
doi: 10.1038/s41467-021-22732-w
Trinquier, J., Uguzzoni, G., Pagnani, A., Zamponi, F. & Weigt, M. Efficient generative modeling of protein sequences using simple autoregressive models. Nature Commun. 12, 5800 (2021).
doi: 10.1038/s41467-021-25756-4
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
pubmed: 35896542
pmcid: 9329459
doi: 10.1038/s41467-022-32007-7
Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).
pubmed: 36108050
pmcid: 9997061
doi: 10.1126/science.add2187
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. bioRxiv https://doi.org/10.1101/2022.07.21.500999 . (2022).
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
pubmed: 15849316
pmcid: 1084323
doi: 10.1093/nar/gki524
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
pubmed: 15476259
doi: 10.1002/prot.20264
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844
pmcid: 8371605
doi: 10.1038/s41586-021-03819-2
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. 117, 1496–1503 (2020).
pubmed: 31896580
pmcid: 6983395
doi: 10.1073/pnas.1914677117
Chakravarty, D. & Porter, L. L. AlphaFold2 fails to predict protein fold switching. Protein Sci. 31, e4353 (2022).
pubmed: 35634782
pmcid: 9134877
doi: 10.1002/pro.4353
Lane, T. J. Protein structure prediction has reached the single-structure frontier. Nat. Methods 20, 170–173 (2023).
Brotzakis, Z. F., Zhang, S. & Vendruscolo, M. AlphaFold prediction of structural ensembles of disordered proteins. bioRxiv https://doi.org/10.1101/2023.01.19.524720 .(2023)
Jing, B., Corso, G., Chang, J., Barzilay, R. & Jaakkola, T. Torsional diffusion for molecular conformer generation. arXiv https://arxiv.org/abs/2206.01729 (2022).
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, 1440–1448 (2015).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, https://arxiv.org/abs/1706.03762 (2017).
Shaw, P., Uszkoreit, J. & Vaswani, A. Self-attention with relative position representations. arXiv https://arxiv.org/abs/1803.02155 (2018).
Tancik, M. et al. Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. 33, 7537–7547 (2020).
Song, Y. et al. Score-based generative modeling through stochastic differential equations. arXiv https://arxiv.org/abs/2011.13456 (2020).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). arXiv https://arxiv.org/abs/1606.08415 (2016).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In: International conference on learning representations (2019).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Hsu, C. et al. Learning inverse folding from millions of predicted structures. In: International conference on machine learning 8946–8970 (PMLR, 2022).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
Schrödinger, L. L. C. The PyMOL molecular graphics system, version 1.8. (2015).
Corey, R. B. & Pauling, L. C. Fundamental dimensions of polypeptide chains. Proc. R. Soc. Lond. B-Biol. Sci. 141, 10–20 (1953).
pubmed: 13047262
doi: 10.1098/rspb.1953.0011
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, 32 (eds. Wallach, H. et al.) 8024–8035 (Curran Associates, Inc., 2019).
Falcon, W. & The PyTorch Lightning team. PyTorch Lightning https://doi.org/10.5281/zenodo.3828935 . (2019)
Kunzmann, P. & Hamacher, K. Biotite: a unifying open source computational biology framework in python. BMC Bioinformatics 19, 1–8 (2018).
doi: 10.1186/s12859-018-2367-z
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
pubmed: 32939066
pmcid: 7759461
doi: 10.1038/s41586-020-2649-2
team, T. pandas development. Pandas-dev/pandas: pandas https://doi.org/10.5281/zenodo.3509134 . (2020)
McKinney, Wes. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference (eds. Walt, Stéfan van der & Millman, Jarrod) 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a .
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
doi: 10.1109/MCSE.2007.55
Waskom, M. L. Seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
doi: 10.21105/joss.03021
Teeter, M. M. Water structure of a hydrophobic protein at atomic resolution: pentagon rings of water molecules in crystals of crambin. Proc. Natl. Acad. Sci. 81, 6014–6018 (1984).
pubmed: 16593516
pmcid: 391849
doi: 10.1073/pnas.81.19.6014
van.Bondi, A. Van der waals volumes and radii. J. Phys. Chem. 68, 441–451 (1964).
doi: 10.1021/j100785a001
Huang, X., Pearce, R. & Zhang, Y. FASPR: an open-source tool for fast and accurate protein side-chain packing. Bioinformatics 36, 3758–3765 (2020).
pubmed: 32259206
pmcid: 7320614
doi: 10.1093/bioinformatics/btaa234
Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using rosetta. Bioinformatics 26, 689–691 (2010).
pubmed: 20061306
pmcid: 2828115
doi: 10.1093/bioinformatics/btq007