Efficient generative modeling of protein sequences using simple autoregressive models.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
04 10 2021
Historique:
received: 24 02 2021
accepted: 23 08 2021
entrez: 5 10 2021
pubmed: 6 10 2021
medline: 3 4 2022
Statut: epublish

Résumé

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10

Identifiants

pubmed: 34608136
doi: 10.1038/s41467-021-25756-4
pii: 10.1038/s41467-021-25756-4
pmc: PMC8490405
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

5800

Commentaires et corrections

Type : ErratumIn

Informations de copyright

© 2021. The Author(s).

Références

UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R. & Luciani, A. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
pubmed: 30357350 doi: 10.1093/nar/gky995
De Juan, D., Pazos, F. & Valencia, A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013).
pubmed: 23458856 doi: 10.1038/nrg3414
Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).
pubmed: 29120346 doi: 10.1088/1361-6633/aa9965
Figliuzzi, M., Barrat-Charlaix, P. & Weigt, M. How pairwise coevolutionary models capture the collective residue variability in proteins? Mol. Biol. Evol. 35, 1018–1027 (2018).
pubmed: 29351669 doi: 10.1093/molbev/msy007
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
pubmed: 22106262 pmcid: 3241805 doi: 10.1073/pnas.1111471108
Levy, R. M., Haldane, A. & Flynn, W. F. Potts hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 43, 55–62 (2017).
pubmed: 27870991 doi: 10.1016/j.sbi.2016.11.004
Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 (1985).
doi: 10.1207/s15516709cog0901_7
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. Evol. 33, 268–280 (2016).
pubmed: 26446903 doi: 10.1093/molbev/msv211
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P. & Springer, M. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
pubmed: 28092658 pmcid: 5383098 doi: 10.1038/nbt.3769
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
pubmed: 24449878 pmcid: 3918776 doi: 10.1073/pnas.1323734111
Cheng, R. R., Nordesjö, O., Hayes, R. L., Levine, H. & Flores, S. C. et al. Connecting the sequence-space of bacterial signaling proteins to phenotypes using coevolutionary landscapes. Mol. Biol. Evol. 33, 3054–3064 (2016).
pubmed: 27604223 pmcid: 5100047 doi: 10.1093/molbev/msw188
Reimer, J. M. et al. Structures of a dimodular nonribosomal peptide synthetase reveal conformational flexibility. Science 366, eaaw4388 (2019).
Bisardi, M., Rodriguez-Rivas, J., Zamponi, F. & Weigt, M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Preprint at arXiv: 2106.02441 (2021).
de la Paz, J. A., Nartey, C. M., Yuvaraj, M. & Morcos, F. Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc. Natl Acad. Sci. USA 117, 5873–5882 (2020).
pubmed: 32123092 pmcid: 7084075 doi: 10.1073/pnas.1913071117
Greener, J. G., Kandathil, S. M. & Jones, D. T. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun. 10, 1–13 (2019).
doi: 10.1038/s41467-019-11994-0
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
pubmed: 31942072 doi: 10.1038/s41586-019-1923-7
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
pubmed: 28056090 pmcid: 5249242 doi: 10.1371/journal.pcbi.1005324
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
pubmed: 31896580 pmcid: 6983395 doi: 10.1073/pnas.1914677117
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
pubmed: 32703877 doi: 10.1126/science.aba3304
Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co-evolutionary fitness landscapes for sequence design. Angew. Chem. Int. Ed. 57, 5674–5678 (2018).
doi: 10.1002/anie.201713220
Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327 (2016).
pubmed: 27629638 doi: 10.1038/nature19946
Jäckel, C., Kast, P. & Hilvert, D. Protein design by directed evolution. Annu. Rev. Biophys. 37, 153–173 (2008).
pubmed: 18573077 doi: 10.1146/annurev.biophys.37.032807.125832
Wilburn, G. W. & Eddy, S. R. Remote homology search with hidden potts models. PLoS Comput. Biol. 16, e1008085 (2020).
pubmed: 33253143 pmcid: 7728182 doi: 10.1371/journal.pcbi.1008085
Barton, J. P., De Leonardis, E., Coucke, A. & Cocco, S. Ace: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32, 3089–3097 (2016).
pubmed: 27329863 doi: 10.1093/bioinformatics/btw328
Sutto, L., Marsili, S., Valencia, A. & Gervasio, F. L. From residue coevolution to protein conformational ensembles and functional dynamics. Proc. Natl Acad. Sci. USA 112, 13567–13572 (2015).
pubmed: 26487681 pmcid: 4640757 doi: 10.1073/pnas.1508584112
Vorberg, S., Seemayer, S. & Söding, J. Synthetic protein alignments by ccmgen quantify noise in residue-residue contact prediction. PLoS Comput. Biol. 14, e1006526 (2018).
pubmed: 30395601 pmcid: 6237422 doi: 10.1371/journal.pcbi.1006526
Barrat-Charlaix, P., Muntoni, A. P., Shimagaki, K., Weigt, M. & Zamponi, F. Sparse generative modeling via parameter reduction of Boltzmann machines: application to protein-sequence families. Phys. Rev. E104, 024407 (2021).
Haldane, A. & Levy, R. M. Mi3-gpu: MCMC-based inverse ising inference on GPUs for protein covariation analysis. Computer Phys. Commun. 260, 107312 (2021).
doi: 10.1016/j.cpc.2020.107312
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. Elife 8, e39397 (2019).
pubmed: 30857591 pmcid: 6436896 doi: 10.7554/eLife.39397
Shimagaki, K. & Weigt, M. Selection of sequence motifs and generative hopfield-potts models for protein families. Phys. Rev. E 100, 032128 (2019).
pubmed: 31639992 doi: 10.1103/PhysRevE.100.032128
Rivoire, O., Reynolds, K. A. & Ranganathan, R. Evolution-based functional decomposition of proteins. PLoS Comput. Biol. 12, e1004817 (2016).
pubmed: 27254668 pmcid: 4890866 doi: 10.1371/journal.pcbi.1004817
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
pubmed: 30250057 pmcid: 6693876 doi: 10.1038/s41592-018-0138-4
McGee, F., Novinger, Q., Levy, R. M., Carnevale, V. & Haldane, A., Generative capacity of probabilistic protein sequence models. Preprint at arXiv: 2012.02296 (2020).
Hawkins-Hooker, A., Depardieu, F., Baur, S., Couairon, G. & Chen, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
pubmed: 33635868 pmcid: 7946179 doi: 10.1371/journal.pcbi.1008736
Costello, Z. & Martin, H. G. How to hallucinate functional proteins. arXiv 1903.00458 (2019).
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021).
doi: 10.1038/s42256-021-00310-5
Amimeur, T., Shaver, J. M., Ketchem, R. R., Taylor, J. A., Clark, R. H. et al. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. bioRxiv 2020.04.12.024844 (2020).
Anand-Achim, N., Eguchi, R. R., Derry, A., Altman, R. B. & Huang, P. Protein sequence design with a learned potential. bioRxiv 2020.01.06.895466 (2020).
Ingraham, J., Garg, V. K., Barzilay, R. & Jaakkola, T. S. Generative models for graph-based protein design. In Neural Information Processing Systems (NeurIPS) (2019).
Jing, B., Eismann, S., Suriana, P., Townshend, R. J. & Dror, R., Learning from protein structure with geometric vector perceptrons. Preprint at arXiv: 2009.01411 (2020).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 1–12 (2018).
doi: 10.1038/s41598-018-34533-1
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst. 11, 402–411 (2020).
pubmed: 32971019 doi: 10.1016/j.cels.2020.08.016
Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxiv 2020.07.22.211482 (2020).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. Preprint at arXiv: 2006.08052 (2020).
Linder, J. & Seelig, G., Fast differentiable DNA and protein sequence optimization for molecular design. Preprint at arXiv: 2005.11275 (2020).
Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl Acad. Sci. USA 118, e2017228118 (2021).
pubmed: 33712545 pmcid: 7980421 doi: 10.1073/pnas.2017228118
Bishop, C. M. Pattern Recognition and Machine Learning. (Springer, 2006).
Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep Learning. Vol. 1. (MIT Press, Cambridge, 2016).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, And Prediction. (Springer Science & Business Media, 2009).
Wu, D., Wang, L. & Zhang, P. Solving statistical mechanics using variational autoregressive networks. Phys. Rev. Lett. 122, 080602 (2019).
pubmed: 30932595 doi: 10.1103/PhysRevLett.122.080602
Sharir, O., Levine, Y., Wies, N., Carleo, G. & Shashua, A. Deep autoregressive models for the efficient variational simulation of many-body quantum systems. Phys. Rev. Lett. 124, 020503 (2020).
pubmed: 32004039 doi: 10.1103/PhysRevLett.124.020503
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013).
doi: 10.1103/PhysRevE.87.012707
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
pubmed: 21268112 doi: 10.1002/prot.22934
Decelle, A., Furtlehner, C. & Seoane, B. Equilibrium and non-equilibrium regimes in the learning of restricted Boltzmann machines. Preprint at arXiv: 2105.13889 (2021).
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. In Genome Informatics 2009: Genome Informatics Series. Vol. 23, 205–211. (World Scientific, 2009).
Söding, J. Protein homology detection by hmm–hmm comparison. Bioinformatics 21, 951–960 (2005).
pubmed: 15531603 doi: 10.1093/bioinformatics/bti125
Laine, E., Karami, Y. & Carbone, A. Gemme: a simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol. 36, 2604–2619 (2019).
pmcid: 6805226 doi: 10.1093/molbev/msz179
Starr, T. N. & Thornton, J. W. Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016).
pubmed: 26833806 pmcid: 4918427 doi: 10.1002/pro.2897
Barton, J. P., Chakraborty, A. K., Cocco, S., Jacquin, H. & Monasson, R. On the entropy of protein families. J. Stat. Phys. 162, 1267–1293 (2016).
doi: 10.1007/s10955-015-1441-4
Tian, P. & Best, R. B. How many protein sequences fold to a given structure? a coevolutionary analysis. Biophys. J. 113, 1719–1730 (2017).
pubmed: 29045866 pmcid: 5647607 doi: 10.1016/j.bpj.2017.08.039

Auteurs

Jeanne Trinquier (J)

Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France.
Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France.

Guido Uguzzoni (G)

Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129, Torino, Italy.
Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo (TO), Italy.

Andrea Pagnani (A)

Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129, Torino, Italy.
Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo (TO), Italy.
INFN Sezione di Torino, Via P. Giuria 1, I-10125, Torino, Italy.

Francesco Zamponi (F)

Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France.

Martin Weigt (M)

Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France. martin.weigt@sorbonne-universite.fr.

Articles similaires

T-Lymphocytes, Regulatory Lung Neoplasms Proto-Oncogene Proteins p21(ras) Animals Humans

Pathogenic mitochondrial DNA mutations inhibit melanoma metastasis.

Spencer D Shelton, Sara House, Luiza Martins Nascentes Melo et al.
1.00
DNA, Mitochondrial Humans Melanoma Mutation Neoplasm Metastasis
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Animals Hemiptera Insect Proteins Phylogeny Insecticides

Classifications MeSH