Efficient generative modeling of protein sequences using simple autoregressive models.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
04 10 2021
04 10 2021
Historique:
received:
24
02
2021
accepted:
23
08
2021
entrez:
5
10
2021
pubmed:
6
10
2021
medline:
3
4
2022
Statut:
epublish
Résumé
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10
Identifiants
pubmed: 34608136
doi: 10.1038/s41467-021-25756-4
pii: 10.1038/s41467-021-25756-4
pmc: PMC8490405
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
5800Commentaires et corrections
Type : ErratumIn
Informations de copyright
© 2021. The Author(s).
Références
UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R. & Luciani, A. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
pubmed: 30357350
doi: 10.1093/nar/gky995
De Juan, D., Pazos, F. & Valencia, A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013).
pubmed: 23458856
doi: 10.1038/nrg3414
Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).
pubmed: 29120346
doi: 10.1088/1361-6633/aa9965
Figliuzzi, M., Barrat-Charlaix, P. & Weigt, M. How pairwise coevolutionary models capture the collective residue variability in proteins? Mol. Biol. Evol. 35, 1018–1027 (2018).
pubmed: 29351669
doi: 10.1093/molbev/msy007
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
pubmed: 22106262
pmcid: 3241805
doi: 10.1073/pnas.1111471108
Levy, R. M., Haldane, A. & Flynn, W. F. Potts hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 43, 55–62 (2017).
pubmed: 27870991
doi: 10.1016/j.sbi.2016.11.004
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985).
doi: 10.1207/s15516709cog0901_7
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. Evol. 33, 268–280 (2016).
pubmed: 26446903
doi: 10.1093/molbev/msv211
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P. & Springer, M. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
pubmed: 28092658
pmcid: 5383098
doi: 10.1038/nbt.3769
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
pubmed: 24449878
pmcid: 3918776
doi: 10.1073/pnas.1323734111
Cheng, R. R., Nordesjö, O., Hayes, R. L., Levine, H. & Flores, S. C. et al. Connecting the sequence-space of bacterial signaling proteins to phenotypes using coevolutionary landscapes. Mol. Biol. Evol. 33, 3054–3064 (2016).
pubmed: 27604223
pmcid: 5100047
doi: 10.1093/molbev/msw188
Reimer, J. M. et al. Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility. Science 366, eaaw4388 (2019).
Bisardi, M., Rodriguez-Rivas, J., Zamponi, F. & Weigt, M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Preprint at arXiv: 2106.02441 (2021).
de la Paz, J. A., Nartey, C. M., Yuvaraj, M. & Morcos, F. Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc. Natl Acad. Sci. USA 117, 5873–5882 (2020).
pubmed: 32123092
pmcid: 7084075
doi: 10.1073/pnas.1913071117
Greener, J. G., Kandathil, S. M. & Jones, D. T. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun. 10, 1–13 (2019).
doi: 10.1038/s41467-019-11994-0
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
pubmed: 31942072
doi: 10.1038/s41586-019-1923-7
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
pubmed: 28056090
pmcid: 5249242
doi: 10.1371/journal.pcbi.1005324
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
pubmed: 31896580
pmcid: 6983395
doi: 10.1073/pnas.1914677117
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
pubmed: 32703877
doi: 10.1126/science.aba3304
Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co-evolutionary fitness landscapes for sequence design. Angew. Chem. Int. Ed. 57, 5674–5678 (2018).
doi: 10.1002/anie.201713220
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
pubmed: 27629638
doi: 10.1038/nature19946
Jäckel, C., Kast, P. & Hilvert, D. Protein design by directed evolution. Annu. Rev. Biophys. 37, 153–173 (2008).
pubmed: 18573077
doi: 10.1146/annurev.biophys.37.032807.125832
Wilburn, G. W. & Eddy, S. R. Remote homology search with hidden potts models. PLoS Comput. Biol. 16, e1008085 (2020).
pubmed: 33253143
pmcid: 7728182
doi: 10.1371/journal.pcbi.1008085
Barton, J. P., De Leonardis, E., Coucke, A. & Cocco, S. Ace: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32, 3089–3097 (2016).
pubmed: 27329863
doi: 10.1093/bioinformatics/btw328
Sutto, L., Marsili, S., Valencia, A. & Gervasio, F. L. From residue coevolution to protein conformational ensembles and functional dynamics. Proc. Natl Acad. Sci. USA 112, 13567–13572 (2015).
pubmed: 26487681
pmcid: 4640757
doi: 10.1073/pnas.1508584112
Vorberg, S., Seemayer, S. & Söding, J. Synthetic protein alignments by ccmgen quantify noise in residue-residue contact prediction. PLoS Comput. Biol. 14, e1006526 (2018).
pubmed: 30395601
pmcid: 6237422
doi: 10.1371/journal.pcbi.1006526
Barrat-Charlaix, P., Muntoni, A. P., Shimagaki, K., Weigt, M. & Zamponi, F. Sparse generative modeling via parameter reduction of Boltzmann machines: application to protein-sequence families. Phys. Rev. E104, 024407 (2021).
Haldane, A. & Levy, R. M. Mi3-gpu: MCMC-based inverse ising inference on GPUs for protein covariation analysis. Computer Phys. Commun. 260, 107312 (2021).
doi: 10.1016/j.cpc.2020.107312
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. Elife 8, e39397 (2019).
pubmed: 30857591
pmcid: 6436896
doi: 10.7554/eLife.39397
Shimagaki, K. & Weigt, M. Selection of sequence motifs and generative hopfield-potts models for protein families. Phys. Rev. E 100, 032128 (2019).
pubmed: 31639992
doi: 10.1103/PhysRevE.100.032128
Rivoire, O., Reynolds, K. A. & Ranganathan, R. Evolution-based functional decomposition of proteins. PLoS Comput. Biol. 12, e1004817 (2016).
pubmed: 27254668
pmcid: 4890866
doi: 10.1371/journal.pcbi.1004817
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
pubmed: 30250057
pmcid: 6693876
doi: 10.1038/s41592-018-0138-4
McGee, F., Novinger, Q., Levy, R. M., Carnevale, V. & Haldane, A., Generative capacity of probabilistic protein sequence models. Preprint at arXiv: 2012.02296 (2020).
Hawkins-Hooker, A., Depardieu, F., Baur, S., Couairon, G. & Chen, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
pubmed: 33635868
pmcid: 7946179
doi: 10.1371/journal.pcbi.1008736
Costello, Z. & Martin, H. G. How to hallucinate functional proteins. arXiv 1903.00458 (2019).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
doi: 10.1038/s42256-021-00310-5
Amimeur, T., Shaver, J. M., Ketchem, R. R., Taylor, J. A., Clark, R. H. et al. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. bioRxiv 2020.04.12.024844 (2020).
Anand-Achim, N., Eguchi, R. R., Derry, A., Altman, R. B. & Huang, P. Protein sequence design with a learned potential. bioRxiv 2020.01.06.895466 (2020).
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. S. Generative models for graph-based protein design. In Neural Information Processing Systems (NeurIPS) (2019).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J. & Dror, R., Learning from protein structure with geometric vector perceptrons. Preprint at arXiv: 2009.01411 (2020).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 1–12 (2018).
doi: 10.1038/s41598-018-34533-1
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411 (2020).
pubmed: 32971019
doi: 10.1016/j.cels.2020.08.016
Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv 2020.07.22.211482 (2020).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. Preprint at arXiv: 2006.08052 (2020).
Linder, J. & Seelig, G., Fast differentiable DNA and protein sequence optimization for molecular design. Preprint at arXiv: 2005.11275 (2020).
Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).
pubmed: 33712545
pmcid: 7980421
doi: 10.1073/pnas.2017228118
Bishop, C. M. Pattern Recognition and Machine Learning. (Springer, 2006).
Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep Learning. Vol. 1. (MIT Press, Cambridge, 2016).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, And Prediction. (Springer Science & Business Media, 2009).
Wu, D., Wang, L. & Zhang, P. Solving statistical mechanics using variational autoregressive networks. Phys. Rev. Lett. 122, 080602 (2019).
pubmed: 30932595
doi: 10.1103/PhysRevLett.122.080602
Sharir, O., Levine, Y., Wies, N., Carleo, G. & Shashua, A. Deep autoregressive models for the efficient variational simulation of many-body quantum systems. Phys. Rev. Lett. 124, 020503 (2020).
pubmed: 32004039
doi: 10.1103/PhysRevLett.124.020503
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013).
doi: 10.1103/PhysRevE.87.012707
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
pubmed: 21268112
doi: 10.1002/prot.22934
Decelle, A., Furtlehner, C. & Seoane, B. Equilibrium and non-equilibrium regimes in the learning of restricted Boltzmann machines. Preprint at arXiv: 2105.13889 (2021).
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. In Genome Informatics 2009: Genome Informatics Series. Vol. 23, 205–211. (World Scientific, 2009).
Söding, J. Protein homology detection by hmm–hmm comparison. Bioinformatics 21, 951–960 (2005).
pubmed: 15531603
doi: 10.1093/bioinformatics/bti125
Laine, E., Karami, Y. & Carbone, A. Gemme: a simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 36, 2604–2619 (2019).
pmcid: 6805226
doi: 10.1093/molbev/msz179
Starr, T. N. & Thornton, J. W. Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016).
pubmed: 26833806
pmcid: 4918427
doi: 10.1002/pro.2897
Barton, J. P., Chakraborty, A. K., Cocco, S., Jacquin, H. & Monasson, R. On the entropy of protein families. J. Stat. Phys. 162, 1267–1293 (2016).
doi: 10.1007/s10955-015-1441-4
Tian, P. & Best, R. B. How many protein sequences fold to a given structure? a coevolutionary analysis. Biophys. J. 113, 1719–1730 (2017).
pubmed: 29045866
pmcid: 5647607
doi: 10.1016/j.bpj.2017.08.039