Generative models for protein sequence modeling: recent advances and future directions.
diffusion models
generative adversarial neural networks (GANs)
generative machine learning (ML) models
natural language processing (NLP)
protein engineering
variational autoencoders (VAE)
Journal
Briefings in bioinformatics
ISSN: 1477-4054
Titre abrégé: Brief Bioinform
Pays: England
ID NLM: 100912837
Informations de publication
Date de publication:
22 09 2023
22 09 2023
Historique:
received:
13
06
2023
revised:
08
09
2023
accepted:
12
09
2023
medline:
2
11
2023
pubmed:
21
10
2023
entrez:
21
10
2023
Statut:
ppublish
Résumé
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Identifiants
pubmed: 37864295
pii: 7325909
doi: 10.1093/bib/bbad358
pmc: PMC10589401
pii:
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press.
Références
Science. 2020 Jul 24;369(6502):440-445
pubmed: 32703877
Curr Opin Biotechnol. 1994 Jun;5(3):253-9
pubmed: 7765007
Bioinformatics. 2022 Jan 12;38(3):655-662
pubmed: 34664614
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127
pubmed: 34232869
Polymers (Basel). 2021 Jul 29;13(15):
pubmed: 34372109
Nucleic Acids Res. 2022 Jan 7;50(D1):D387-D390
pubmed: 34850094
Nat Biomed Eng. 2021 Jun;5(6):600-612
pubmed: 33859386
Science. 2023 Mar 17;379(6637):1123-1130
pubmed: 36927031
Nature. 2023 Aug;620(7976):1089-1100
pubmed: 37433327
Bioinformatics. 2018 Aug 1;34(15):2605-2613
pubmed: 29554211
PLoS Comput Biol. 2021 Feb 26;17(2):e1008736
pubmed: 33635868
BMC Bioinformatics. 2018 Aug 3;19(1):293
pubmed: 30075707
Virus Evol. 2023 Apr 07;9(1):vead022
pubmed: 37066021
Sci Rep. 2021 Mar 12;11(1):5852
pubmed: 33712669
Nat Commun. 2021 Apr 23;12(1):2403
pubmed: 33893299
Bioinformatics. 2007 May 15;23(10):1282-8
pubmed: 17379688
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
Science. 2021 Aug 20;373(6557):871-876
pubmed: 34282049
Protein Eng Des Sel. 2014 Oct;27(10):419-29
pubmed: 24786107
Microb Biotechnol. 2013 Jul;6(4):349-60
pubmed: 23617701
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701
pubmed: 33390682
Protein Sci. 2022 Jan;31(1):141-146
pubmed: 34655133
Elife. 2019 Sep 05;8:
pubmed: 31487240
Chem. 2023 Jul 13;9(7):1828-1849
pubmed: 37614363
Curr Opin Biotechnol. 2011 Jun;22(3):427-33
pubmed: 21247751
Clin Cancer Res. 2020 May 1;26(9):2140-2150
pubmed: 31924738
Biotechnol Bioeng. 2020 Dec;117(12):3820-3834
pubmed: 32740905
Protein Sci. 2003 Jun;12(6):1271-82
pubmed: 12761398
Nat Commun. 2021 Nov 2;12(1):6302
pubmed: 34728624
Nat Methods. 2019 Dec;16(12):1315-1322
pubmed: 31636460
Brief Bioinform. 2021 Sep 2;22(5):
pubmed: 33834190
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
Sci Rep. 2019 Jul 8;9(1):9848
pubmed: 31285519
Cell Syst. 2021 Jan 20;12(1):92-101.e8
pubmed: 33212013
Nature. 2021 Aug;596(7873):583-589
pubmed: 34265844
Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758
pubmed: 33897979
BMC Bioinformatics. 2019 Dec 17;20(1):723
pubmed: 31847804
Pharmaceutics. 2023 Apr 25;15(5):
pubmed: 37242577
PLoS One. 2019 Nov 14;14(11):e0225317
pubmed: 31725778
Nat Biotechnol. 2011 Oct 30;29(11):1046-51
pubmed: 22037378
Nat Methods. 2021 Apr;18(4):389-396
pubmed: 33828272
Curr Opin Struct Biol. 2022 Feb;72:226-236
pubmed: 34963082
Cell Syst. 2020 Jul 22;11(1):49-62.e16
pubmed: 32711843
ACS Synth Biol. 2019 Jun 21;8(6):1411-1420
pubmed: 31117361
J Comput Biol. 2023 Jan;30(1):95-111
pubmed: 35950958
ACS Biomater Sci Eng. 2022 Mar 14;8(3):1156-1165
pubmed: 35129957
Sci Rep. 2021 Jan 11;11(1):321
pubmed: 33432013
Front Genet. 2020 Jan 09;10:1243
pubmed: 31993067
Nat Commun. 2022 Jul 27;13(1):4348
pubmed: 35896542
Chembiochem. 2009 May 25;10(8):1293-6
pubmed: 19422008
Cell Syst. 2021 Nov 17;12(11):1026-1045.e7
pubmed: 34416172
Proc Natl Acad Sci U S A. 2021 Jun 8;118(23):
pubmed: 34078670
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):
pubmed: 33876751
PLoS One. 2021 Feb 25;16(2):e0244430
pubmed: 33630862
Nat Commun. 2021 Oct 4;12(1):5800
pubmed: 34608136
Sci Rep. 2018 Nov 1;8(1):16189
pubmed: 30385875
Proteins. 2021 Jun;89(6):697-707
pubmed: 33538038
Bioinformatics. 2018 Sep 1;34(17):i802-i810
pubmed: 30423091
IEEE J Biomed Health Inform. 2021 Jan;25(1):218-226
pubmed: 32340968
Molecules. 2017 Oct 17;22(10):
pubmed: 29039790
Front Immunol. 2018 Jun 07;9:1302
pubmed: 29951057
J Chem Inf Model. 2018 Feb 26;58(2):472-479
pubmed: 29355319
Comput Biol Chem. 2022 Aug;99:107717
pubmed: 35802991
ACS Synth Biol. 2020 Aug 21;9(8):2154-2161
pubmed: 32649182
Int J Phytoremediation. 2011;13 Suppl 1:77-89
pubmed: 22046752
Bioinformatics. 2022 Apr 12;38(8):2269-2277
pubmed: 35176146