Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.
MCDA
TOPSIS
embeddings
ensemble learning
imbalanced assay-labeled datasets
machine learning
protein fitness prediction
sampling methods
sequence representation
Journal
Pharmaceutics
ISSN: 1999-4923
Titre abrégé: Pharmaceutics
Pays: Switzerland
ID NLM: 101534003
Informations de publication
Date de publication:
25 Apr 2023
25 Apr 2023
Historique:
received:
24
02
2023
revised:
19
04
2023
accepted:
21
04
2023
medline:
27
5
2023
pubmed:
27
5
2023
entrez:
27
5
2023
Statut:
epublish
Résumé
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Identifiants
pubmed: 37242577
pii: pharmaceutics15051337
doi: 10.3390/pharmaceutics15051337
pmc: PMC10224321
pii:
doi:
Types de publication
Journal Article
Langues
eng
Subventions
Organisme : Michigan State University
ID : N/A
Références
Science. 2023 Mar 17;379(6637):1123-1130
pubmed: 36927031
J Cheminform. 2021 Feb 8;13(1):7
pubmed: 33557952
Bioinformatics. 2018 Aug 1;34(15):2642-2648
pubmed: 29584811
Genes Dev. 1996 Jul 1;10(13):1580-94
pubmed: 8682290
BMC Bioinformatics. 2019 Dec 17;20(1):723
pubmed: 31847804
Sci Rep. 2021 Jan 19;11(1):1761
pubmed: 33469042
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127
pubmed: 34232869
Curr Opin Pharmacol. 2009 Oct;9(5):608-14
pubmed: 19523876
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):
pubmed: 33876751
Hum Genet. 2022 Oct;141(10):1629-1647
pubmed: 34967936
PLoS One. 2021 May 18;16(5):e0251865
pubmed: 34003870
Front Genet. 2020 Dec 21;11:605620
pubmed: 33408741
Front Genet. 2021 Feb 19;11:607824
pubmed: 33737946
Nat Commun. 2022 Oct 22;13(1):6298
pubmed: 36273003
IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3744-3753
pubmed: 34460382
Nat Rev Mol Cell Biol. 2009 Dec;10(12):866-76
pubmed: 19935669
Bioinformatics. 2017 Sep 01;33(17):2753-2755
pubmed: 28472272
Mol Syst Biol. 2006;2:2006.0028
pubmed: 16738572
Proc Natl Acad Sci U S A. 2019 Apr 30;116(18):8852-8858
pubmed: 30979809
Bioinformatics. 2021 Apr 5;36(24):5600-5609
pubmed: 33367627
Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701
pubmed: 33390682
Nat Commun. 2022 Feb 8;13(1):746
pubmed: 35136054
Nat Methods. 2019 Dec;16(12):1315-1322
pubmed: 31636460
Cell. 2000 Oct 13;103(2):211-25
pubmed: 11057895
Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758
pubmed: 33897979
J Mol Biol. 2021 May 28;433(11):166882
pubmed: 33972018
Biochem Biophys Res Commun. 2020 Dec 10;533(3):553-558
pubmed: 32981683
Biochemistry. 2017 Mar 21;56(11):1656-1671
pubmed: 28248518
Nat Methods. 2021 Apr;18(4):389-396
pubmed: 33828272
Ophthalmic Physiol Opt. 2014 Sep;34(5):502-8
pubmed: 24697967
Cell Syst. 2021 Nov 17;12(11):1026-1045.e7
pubmed: 34416172
Proc Natl Acad Sci U S A. 2021 Jun 8;118(23):
pubmed: 34078670
Nat Commun. 2022 Jul 27;13(1):4348
pubmed: 35896542
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12
pubmed: 25348405
J Am Med Inform Assoc. 2022 Aug 16;29(9):1525-1534
pubmed: 35686364
J Nucl Med. 2018 Jun;59(6):885-891
pubmed: 29545374
Proc Natl Acad Sci U S A. 2014 Jun 10;111(23):8488-93
pubmed: 24889604
Proc Natl Acad Sci U S A. 1997 Sep 16;94(19):10015-7
pubmed: 9294154
Chem Soc Rev. 2010 Jan;39(1):156-64
pubmed: 20023846
Comput Struct Biotechnol J. 2020 Jan 22;18:1301-1310
pubmed: 32612753
Nucleic Acids Res. 2022 Jan 7;50(D1):D387-D390
pubmed: 34850094
Biochem Med (Zagreb). 2011;21(3):203-9
pubmed: 22420233
Annu Rev Biophys. 2008;37:153-73
pubmed: 18573077
Bioinformatics. 2022 Apr 12;38(8):2102-2110
pubmed: 35020807