Learning protein fitness models from evolutionary and assay-labeled data.


Journal

Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648

Informations de publication

Date de publication:
07 2022
Historique:
received: 09 04 2021
accepted: 02 11 2021
pubmed: 19 1 2022
medline: 20 7 2022
entrez: 18 1 2022
Statut: ppublish

Résumé

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

Identifiants

pubmed: 35039677
doi: 10.1038/s41587-021-01146-5
pii: 10.1038/s41587-021-01146-5
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S. Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

1114-1122

Subventions

Organisme : NLM NIH HHS
ID : T32 LM012417
Pays : United States

Informations de copyright

© 2022. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR–Cas9. Science 346, 1258096 (2014).
Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR–Cas9 for genome engineering. Cell 157, 1262–1278 (2014).
pubmed: 24906146 pmcid: 4343198 doi: 10.1016/j.cell.2014.05.010
Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802–805 (1994).
pubmed: 8303295 doi: 10.1126/science.8303295
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
pubmed: 18097458 doi: 10.1038/nrd2399
Pollegioni, L., Schonbrunn, E. & Siehl, D. Molecular basis of glyphosate resistance–different approaches through protein engineering. FEBS J. 278, 2753–2766 (2011).
pubmed: 21668647 pmcid: 3145815 doi: 10.1111/j.1742-4658.2011.08214.x
Joo, H., Lin, Z. & Arnold, F. H. Laboratory evolution of peroxide-mediated cytochrome P450 hydroxylation. Nature 399, 670–673 (1999).
pubmed: 10385118 doi: 10.1038/21395
Heim, R. & Tsien, R. Y. Engineering green fluorescent protein for improved brightness, longer wavelengths and fluorescence resonance energy transfer. Curr. Biol. 6, 178–182 (1996).
pubmed: 8673464 doi: 10.1016/S0960-9822(02)00450-5
Binz, H. K., Amstutz, P. & Plückthun, A. Engineering novel binding proteins from nonimmunoglobulin domains. Nat. Biotech. 23, 1257–1268 (2005).
doi: 10.1038/nbt1127
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
doi: 10.1021/ar960017f
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
pubmed: 28430426 pmcid: 5717763 doi: 10.1021/acs.jctc.7b00125
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
pubmed: 15870208 pmcid: 1100762 doi: 10.1073/pnas.0408930102
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
pubmed: 28706065 pmcid: 5568797 doi: 10.1126/science.aan0693
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
pubmed: 32703877 doi: 10.1126/science.aba3304
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
pubmed: 23277561 doi: 10.1073/pnas.1215251110
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021).
Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotech. 39, 691–696 (2021).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 773–782 (PMLR, 2019).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
pubmed: 31308553 doi: 10.1038/s41592-019-0496-6
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
pubmed: 15980494 pmcid: 1160148 doi: 10.1093/nar/gki387
Dehouck, Y., Kwasigroch, J. M., Gilis, D. & Rooman, M. Popmusic 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151 (2011).
doi: 10.1186/1471-2105-12-151
Capriotti, E., Fariselli, P. & Casadio, R. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).
pubmed: 15980478 pmcid: 1160136 doi: 10.1093/nar/gki375
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotech. 35, 128–135 (2017).
doi: 10.1038/nbt.3769
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
pubmed: 30250057 pmcid: 6693876 doi: 10.1038/s41592-018-0138-4
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
pubmed: 22689647 pmcid: 3394338 doi: 10.1093/nar/gks539
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
pubmed: 20354512 pmcid: 2855889 doi: 10.1038/nmeth0410-248
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation 34, 57–65 (2013).
pubmed: 23033316 doi: 10.1002/humu.22225
Mann, J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).
pubmed: 25102049 pmcid: 4125067 doi: 10.1371/journal.pcbi.1003776
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
pubmed: 24449878 pmcid: 3918776
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. E 33, 268–280 (2016).
doi: 10.1093/molbev/msv211
Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).
pubmed: 23035249 pmcid: 3479514 doi: 10.1073/pnas.1209751109
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
pubmed: 25455030 pmcid: 4254498 doi: 10.1016/j.cub.2014.09.072
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
pubmed: 27193686 pmcid: 4968632 doi: 10.1038/nature17995
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. RNA 19, 1537–1551 (2013).
pubmed: 24064791 pmcid: 3851721 doi: 10.1261/rna.040709.113
Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
pubmed: 27391790 pmcid: 4985287 doi: 10.7554/eLife.16965
Otwinowski, J., McCandlish, D. M. & Plotkin, J. B. Inferring the shape of global epistasis. Proc. Natl Acad. Sci. USA 115, E7550–E7558 (2018).
pubmed: 30037990 pmcid: 6094095 doi: 10.1073/pnas.1804015115
Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is transfer learning necessary for protein landscape prediction? Preprint at https://arxiv.org/abs/2011.03443 (2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 9689–9701 (Curran Associates, Inc., 2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
pubmed: 31636460 pmcid: 7067682 doi: 10.1038/s41592-019-0598-1
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
pubmed: 33828272 doi: 10.1038/s41592-021-01100-y
Shamsi, Z., Chan, M. & Shukla, D. TLmutation: predicting the effects of mutations using transfer learning. J. Phys. Chem. B. 124, 3845–3854 (2020).
pubmed: 32308006 doi: 10.1021/acs.jpcb.0c00197
Barrat-Charlaix, P., Figliuzzi, M. & Weigt, M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 6, 37812 (2016).
pubmed: 27886273 pmcid: 5122905 doi: 10.1038/srep37812
Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: long papers (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: long and short papers, 4171–4186 (2019).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
pubmed: 25398609 doi: 10.1093/bioinformatics/btu739
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at bioRxiv https://doi.org/10.1101/2020.07.12.199554 (2020).
Aghazadeh, A. et al. Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
pubmed: 34471113 pmcid: 8410946 doi: 10.1038/s41467-021-25371-3
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
pubmed: 23509263 pmcid: 3619334 doi: 10.1073/pnas.1303309110
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29 (2011).
pubmed: 21593126 pmcid: 3125773 doi: 10.1093/nar/gkr367
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Tran. Inf. Syst. 20, 422–446 (2002).
doi: 10.1145/582415.582418
Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
pubmed: 34815338 pmcid: 8640744 doi: 10.1073/pnas.2104878118
Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Systems 6, 116–124 (2018).
pubmed: 29226803 doi: 10.1016/j.cels.2017.11.003
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Vol. 32 (NeurIPS, 2019).
Hardt, M. & Recht, B.Patterns, predictions, and actions: A story about machine learning. Preprint at https://arxiv.org/abs/2102.05242 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2020) Vol. 33 (NeurIPS, 2020).
Sugiyama, M., Krauledat, M. & Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007).
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
pubmed: 19432540 doi: 10.1089/cmb.2008.0173
Kawashima, S. et al. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5 (2007).
pubmed: 17998252 pmcid: 2238890 doi: 10.1093/nar/gkm998
Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
pubmed: 9918945 doi: 10.1093/bioinformatics/14.9.755
Besag, J. Statistical analysis of non-lattice data. J. Royal Stat. Soc.: Ser. D. Statistician 24, 179–195 (1975).
Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).
pubmed: 26225866 pmcid: 4520494 doi: 10.1371/journal.pcbi.1004182
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. International Conference on Machine Learning (eds Hal, D., III & Aarti, S.) 950–959 (PMLR, 2020).

Auteurs

Chloe Hsu (C)

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. chloehsu@berkeley.edu.

Hunter Nisonoff (H)

Center for Computational Biology, University of California, Berkeley, USA.

Clara Fannjiang (C)

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.

Jennifer Listgarten (J)

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. jennl@berkeley.edu.
Center for Computational Biology, University of California, Berkeley, USA. jennl@berkeley.edu.

Articles similaires

Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Humans Artificial Intelligence Neoplasms Prognosis Image Processing, Computer-Assisted

Classifications MeSH