Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries.
Deep learning
Directed evolution
Library design
Machine learning
Journal
Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969
Informations de publication
Date de publication:
2022
2022
Historique:
entrez:
28
4
2022
pubmed:
29
4
2022
medline:
3
5
2022
Statut:
ppublish
Résumé
Proteins are small yet valuable biomolecules that play a versatile role in therapeutics and diagnostics. The intricate sequence-structure-function paradigm in the realm of proteins opens the possibility for directly mapping amino acid sequence to function. However, the rugged nature of the protein fitness landscape and an astronomical number of possible mutations even for small proteins make navigating this system a daunting task. Moreover, the scarcity of functional proteins and the ease with which deleterious mutations are introduced, due to complex epistatic relationships, compound the existing challenges. This highlights the need for auxiliary tools in current techniques such as rational design and directed evolution. To that end, the state-of-the-art machine learning can offer time and cost efficiency in finding high fitness proteins, circumventing unnecessary wet-lab experiments. In the context of improving library design, machine learning provides valuable insights via its unique features such as high adaptation to complex systems, multi-tasking, and parallelism, and the ability to capture hidden trends in input data. Finally, both the advancements in computational resources and the rapidly increasing number of sequences in protein databases will allow more promising and detailed insights delivered from machine learning to protein library design. In this chapter, fundamental concepts and a method for machine learning-driven library design leveraging deep sequencing datasets will be discussed. We elaborate on (1) basic knowledge about machine learning algorithms, (2) the benefit of machine learning in library design, and (3) methodology for implementing machine learning in library design.
Identifiants
pubmed: 35482186
doi: 10.1007/978-1-0716-2285-8_5
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
87-104Informations de copyright
© 2022. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.
Références
Hogan BL (1996) Bone morphogenetic proteins: multifunctional regulators of vertebrate development. Genes Dev 10:1580–1594
pubmed: 8682290
doi: 10.1101/gad.10.13.1580
Schlessinger J (2000) Cell signaling by receptor tyrosine kinases. Cell 103:211–225
pubmed: 11057895
doi: 10.1016/S0092-8674(00)00114-8
Syrovatkina V, Alegre KO, Dey R et al (2016) Regulation, signaling, and physiological functions of G-proteins. J Mol Biol 428:3850–3868
pubmed: 27515397
pmcid: 5023507
doi: 10.1016/j.jmb.2016.08.002
Hellinga HW, Marvin JS (1998) Protein engineering and the development of generic biosensors. Trends Biotechnol 16:183–189
pubmed: 9586241
doi: 10.1016/S0167-7799(98)01174-3
Mishra NK, Chang J, Zhao PX (2014) Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS One 9:e100278
pubmed: 24968309
pmcid: 4072671
doi: 10.1371/journal.pone.0100278
Yang T, Wu JC, Yan C et al (2011) Virtual screening using molecular simulations. Proteins 79:1940–1951
pubmed: 21491494
pmcid: 3092865
doi: 10.1002/prot.23018
Wrenbeck EE, Faber MS, Whitehead TA (2017) Deep sequencing methods for protein engineering and design. Curr Opin Struct Biol 45:36–44
pubmed: 27886568
doi: 10.1016/j.sbi.2016.11.001
Kronqvist N, Löfblom J, Jonsson A et al (2008) A novel affinity protein selection system based on staphylococcal cell surface display and flow cytometry. Protein Eng Des Sel 21:247–255
pubmed: 18239074
doi: 10.1093/protein/gzm090
Yang KK, Wu Z, Arnold FH (2019) Machine-learning-guided directed evolution for protein engineering. Nat Methods 16:687–694
pubmed: 31308553
doi: 10.1038/s41592-019-0496-6
Bohr H, Bohr J, Brunak S et al (1990) A novel approach to prediction of the 3-dimensional structures of protein backbones by neural networks. FEBS Lett 261:43–46
pubmed: 19928342
doi: 10.1016/0014-5793(90)80632-S
Ofran Y, Rost B (2003) Predicted protein-protein interaction sites from local sequence information. FEBS Lett 544:236–239
pubmed: 12782323
doi: 10.1016/S0014-5793(03)00456-3
Ward JJ, McGuffin LJ, Buxton BF et al (2003) Secondary structure prediction with support vector machines. Bioinformatics 19:1650–1655
pubmed: 12967961
doi: 10.1093/bioinformatics/btg223
Petrova NV, Wu CH (2006) Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics 7:1–12
doi: 10.1186/1471-2105-7-312
Li BQ, Feng KY, Chen L et al (2012) Prediction of protein-protein interaction sites by random forest algorithm with mRMR and IFS. PLoS One 7:1–10
Quan L, Lv Q, Zhang Y (2016) STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 32:2936–2946
pubmed: 27318206
pmcid: 5039926
doi: 10.1093/bioinformatics/btw361
Golinski AW, Mischler KM, Laxminarayan S et al (2021) High-throughput developability assays enable library-scale identification of producible protein scaffold variants. Proc Natl Acad Sci U S A 118:1–11
doi: 10.1073/pnas.2026658118
Tahir M, Tayara H, Chong KT (2019) iRNA-PseKNC(2methyl): identify RNA 2’-O-methylation sites by convolution neural network and Chou’s pseudo components. J Theor Biol 465:1–6
pubmed: 30590059
doi: 10.1016/j.jtbi.2018.12.034
Bloom JD, Labthavikul ST, Otey CR et al (2006) Protein stability promotes evolvability. Proc Natl Acad Sci U S A 103:5869–5874
pubmed: 16581913
pmcid: 1458665
doi: 10.1073/pnas.0510098103
Saito Y, Oikawa M, Nakazawa H et al (2018) Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth Biol 7:2014–2022
pubmed: 30103599
doi: 10.1021/acssynbio.8b00155
Alley EC, Khimulya G, Biswas S et al (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16:1315–1322
pubmed: 31636460
pmcid: 7067682
doi: 10.1038/s41592-019-0598-1
Biswas S, Khimulya G, Alley EC, Esvelt, KM, Church GM (2021) Low-N protein engineering with dataefficient deep learning. Nat Methods 18(4):389–396 https://doi.org/10.1038/s41592-021-01100-y
Suzek BE, Wang Y, Huang H et al (2015) UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926–932
pubmed: 25398609
doi: 10.1093/bioinformatics/btu739
Crawshaw M (2020) Multi-Task Learning with Deep Neural Networks: A Survey. arXiv:2009.09796
Im J, Park B, Han K (2019) A generative model for constructing nucleic acid sequences binding to a protein. BMC Genomics 20:1–13
doi: 10.1186/s12864-019-6299-4
Ness JE, Kim S, Gottman A et al (2002) Synthetic shuffling expands functional protein diversity by allowing amino acids to recombine independently. Nat Biotechnol 20:1251–1255
pubmed: 12426575
doi: 10.1038/nbt754
Gupta RD, Tawfik DS (2008) Directed enzyme evolution via small and effective neutral drift libraries. Nat Methods 5:939–942
pubmed: 18931667
doi: 10.1038/nmeth.1262
Engqvist MKM, Nielsen J (2015) ANT: software for generating and evaluating degenerate codons for natural and expanded genetic codes. ACS Synth Biol 4:935–938
pubmed: 25901796
doi: 10.1021/acssynbio.5b00018
Jacobs TM, Yumerefendi H, Kuhlman B et al (2015) SwiftLib: rapid degenerate-codon-library optimization through dynamic programming. Nucleic Acids Res 43:e34
pubmed: 25539925
doi: 10.1093/nar/gku1323
Menéndez ML, Pardo JA, Pardo L et al (1997) The Jensen-Shannon divergence. J Frankl Inst 334:307–318
doi: 10.1016/S0016-0032(96)00063-4
Bewick V, Cheek L, Ball J (2004) Statistics review 12: survival analysis. Crit Care 8:389–394
pubmed: 15469602
pmcid: 1065034
doi: 10.1186/cc2955
Tensorflow (2017) Index @ Www.Tensorflow.Org
Chollet F, & others (2015) Keras. GitHub. Retrieved from https://github.com/fchollet/keras
Mazza D, Pagani M (2021) Automatic differentiation in PCF. Proc ACM Program Lang 5:1–4
doi: 10.1145/3434309
Pedregosa F, Varoquaux G, Gramfort A et al (2012) Scikit-learn: machine learning in Python. J Mach Learn Res 12
McKinney W (2010) Data structures for statistical computing in Python. In: Proc 9th Python Sci Conf 1, pp 56–61
Harris CR, Millman KJ, van der Walt SJ et al (2020) Array programming with NumPy. Nature 585:357–362
pubmed: 32939066
pmcid: 7759461
doi: 10.1038/s41586-020-2649-2
Abadi M, Barham P, Chen J et al (2016) TensorFlow: a system for large-scale machine learning. In: Proc 12th USENIX Symp Oper Syst Des implementation, vol 2016. OSDI, pp 265–283
Rao R, Bhattacharya N, Thomas N et al (2019) Evaluating protein transfer learning with tape. Adv Neural Inf Process Syst 32:9689
pubmed: 33390682
pmcid: 7774645
Whitehead TA, Chevalier A, Song Y et al (2012) Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat Biotechnol 30:543–548
pubmed: 22634563
pmcid: 3638900
doi: 10.1038/nbt.2214
Shroff R, Cole AW, Diaz DJ et al (2020) Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS Synth Biol 9:2927–2935
pubmed: 33064458
doi: 10.1021/acssynbio.0c00345
Zhao Z, Gong X (2019) Protein-protein interaction interface residue pair prediction based on deep learning architecture. IEEE/ACM Trans Comput Biol Bioinformatics 16:1753–1759
doi: 10.1109/TCBB.2017.2706682
Zhang Q, Zhang M, Chen T et al (2019) Recent advances in convolutional neural network acceleration. Neurocomputing 323:37–51
doi: 10.1016/j.neucom.2018.09.038
Yu Y, Si X, Hu C et al (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31:1235–1270
pubmed: 31113301
doi: 10.1162/neco_a_01199
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
pubmed: 9377276
doi: 10.1162/neco.1997.9.8.1735
Cho K, Merriënboer B Van, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014 - 2014 Conf Empir Methods Nat Lang Process Proc Conf 1724–1734
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need, In: Advances in neural information processing systems, pp. 5998–6008
Brandes N, Ofer D, Peleg Y et al (2021) ProteinBERT: a universal deep-learning model of protein sequence and function. Comput Biol Chem 95:107596
doi: 10.1016/j.compbiolchem.2021.107596
Bergstra J, Yamins D, Cox DD (2013) Making a science of model search: Hyperparameter optimizationin hundreds of dimensions for vision architectures. Presented at the 30th International Conference on Machine Learning (ICML 2013), Atlanta, Gerorgia, June 16–21, 2013. In JMLR Workshop and Conference Proceedings 28(1):115–123
Raschka S (2018) Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv:1811.12808
Chao G, Lau WL, Hackel BJ et al (2006) Isolating and engineering human antibodies using yeast surface display. Nat Protoc 1:755–768
pubmed: 17406305
doi: 10.1038/nprot.2006.94
Woldring DR, Holec PV, Zhou H et al (2015) High-throughput ligand discovery reveals a sitewise gradient of diversity in broadly evolved hydrophilic fibronectin domains. PLoS One 10:e0138956
pubmed: 26383268
pmcid: 4575168
doi: 10.1371/journal.pone.0138956
Woldring DR, Holec PV, Stern LA et al (2017) A gradient of sitewise diversity promotes evolutionary fitness for binder discovery in a three-helix bundle protein scaffold. Biochemistry 56:1656–1671
pubmed: 28248518
doi: 10.1021/acs.biochem.6b01142
Kruziki MA, Bhatnagar S, Woldring DR et al (2015) A 45-amino-acid scaffold mined from the pdb for high-affinity ligand engineering. Chem Biol 22:946–956
pubmed: 26165154
pmcid: 4536934
doi: 10.1016/j.chembiol.2015.06.012
Kruziki MA, Sarma V, Hackel BJ (2018) Constrained combinatorial libraries of Gp2 proteins enhance discovery of PD-L1 binders. ACS Comb Sci 20:423–435
pubmed: 29799714
pmcid: 6051759
doi: 10.1021/acscombsci.8b00010
Stern LALA, Csizmar CMCM, Woldring DRDR et al (2017) Titratable avidity reduction enhances affinity discrimination in mammalian cellular selections of yeast-displayed ligands. ACS Comb Sci 19:315–323
pubmed: 28322543
pmcid: 5521271
doi: 10.1021/acscombsci.6b00191
Hasenhindl C, Traxlmayr MW, Wozniak-Knopp G et al (2013) Stability assessment on a library scale: a rapid method for the evaluation of the commutability and insertion of residues in C-terminal loops of the CH3 domains of IgG1-Fc. Protein Eng Des Sel 26:675–682
pubmed: 24006374
pmcid: 3785252
doi: 10.1093/protein/gzt041
Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Synth Lect Artif Intell Mach Learn 3:1–130
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. In: Sutton RS, Barto AG (eds) Bradford book. The MIT Press
Nguyen G, Dlugolinsky S, Bobák M et al (2020) Machine learning and deep learning frameworks and libraries for large-scale data mining : a survey. Artif Intell Rev 52:77–124
doi: 10.1007/s10462-018-09679-z
Yang KK, Wu Z, Bedbrook CN et al (2018) Learned protein embeddings for machine learning. Bioinformatics 34:2642–2648
pubmed: 29584811
pmcid: 6061698
doi: 10.1093/bioinformatics/bty178
Mei HU, Liao ZH, Zhou Y et al (2005) A new set of amino acid descriptors and its application in peptide QSARs. Pept Sci Orig Res Biomol 80:775–786
Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272
pubmed: 32015543
pmcid: 7056644
doi: 10.1038/s41592-019-0686-2