Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.

MCDA TOPSIS embeddings ensemble learning imbalanced assay-labeled datasets machine learning protein fitness prediction sampling methods sequence representation

Journal

Pharmaceutics

ISSN: 1999-4923

Titre abrégé: Pharmaceutics

Pays: Switzerland

ID NLM: 101534003

Informations de publication

Date de publication:
25 Apr 2023

Historique:

received: 24 02 2023

revised: 19 04 2023

accepted: 21 04 2023

medline: 27 5 2023

pubmed: 27 5 2023

entrez: 27 5 2023

Statut: epublish

Résumé

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Identifiants

DOI: 10.3390/pharmaceutics15051337 PMID: 37242577 PMC: PMC10224321

pubmed: 37242577

pii: pharmaceutics15051337

doi: 10.3390/pharmaceutics15051337

pmc: PMC10224321

pii:

doi:

Types de publication

Journal Article

Langues

eng

Subventions

Organisme : Michigan State University

ID : N/A

Références

Science. 2023 Mar 17;379(6637):1123-1130

pubmed: 36927031

J Cheminform. 2021 Feb 8;13(1):7

pubmed: 33557952

Bioinformatics. 2018 Aug 1;34(15):2642-2648

pubmed: 29584811

Genes Dev. 1996 Jul 1;10(13):1580-94

pubmed: 8682290

BMC Bioinformatics. 2019 Dec 17;20(1):723

pubmed: 31847804

Sci Rep. 2021 Jan 19;11(1):1761

pubmed: 33469042

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127

pubmed: 34232869

Curr Opin Pharmacol. 2009 Oct;9(5):608-14

pubmed: 19523876

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15):

pubmed: 33876751

Hum Genet. 2022 Oct;141(10):1629-1647

pubmed: 34967936

PLoS One. 2021 May 18;16(5):e0251865

pubmed: 34003870

Front Genet. 2020 Dec 21;11:605620

pubmed: 33408741

Front Genet. 2021 Feb 19;11:607824

pubmed: 33737946

Nat Commun. 2022 Oct 22;13(1):6298

pubmed: 36273003

IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3744-3753

pubmed: 34460382

Nat Rev Mol Cell Biol. 2009 Dec;10(12):866-76

pubmed: 19935669

Bioinformatics. 2017 Sep 01;33(17):2753-2755

pubmed: 28472272

Mol Syst Biol. 2006;2:2006.0028

pubmed: 16738572

Proc Natl Acad Sci U S A. 2019 Apr 30;116(18):8852-8858

pubmed: 30979809

Bioinformatics. 2021 Apr 5;36(24):5600-5609

pubmed: 33367627

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701

pubmed: 33390682

Nat Commun. 2022 Feb 8;13(1):746

pubmed: 35136054

Nat Methods. 2019 Dec;16(12):1315-1322

pubmed: 31636460

Cell. 2000 Oct 13;103(2):211-25

pubmed: 11057895

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758

pubmed: 33897979

J Mol Biol. 2021 May 28;433(11):166882

pubmed: 33972018

Biochem Biophys Res Commun. 2020 Dec 10;533(3):553-558

pubmed: 32981683

Biochemistry. 2017 Mar 21;56(11):1656-1671

pubmed: 28248518

Nat Methods. 2021 Apr;18(4):389-396

pubmed: 33828272

Ophthalmic Physiol Opt. 2014 Sep;34(5):502-8

pubmed: 24697967

Cell Syst. 2021 Nov 17;12(11):1026-1045.e7

pubmed: 34416172

Proc Natl Acad Sci U S A. 2021 Jun 8;118(23):

pubmed: 34078670

Nat Commun. 2022 Jul 27;13(1):4348

pubmed: 35896542

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12

pubmed: 25348405

J Am Med Inform Assoc. 2022 Aug 16;29(9):1525-1534

pubmed: 35686364

J Nucl Med. 2018 Jun;59(6):885-891

pubmed: 29545374

Proc Natl Acad Sci U S A. 2014 Jun 10;111(23):8488-93

pubmed: 24889604

Proc Natl Acad Sci U S A. 1997 Sep 16;94(19):10015-7

pubmed: 9294154

Chem Soc Rev. 2010 Jan;39(1):156-64

pubmed: 20023846

Comput Struct Biotechnol J. 2020 Jan 22;18:1301-1310

pubmed: 32612753

Nucleic Acids Res. 2022 Jan 7;50(D1):D387-D390

pubmed: 34850094

Biochem Med (Zagreb). 2011;21(3):203-9

pubmed: 22420233

Annu Rev Biophys. 2008;37:153-73

pubmed: 18573077

Bioinformatics. 2022 Apr 12;38(8):2102-2110

pubmed: 35020807

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Subventions

Références

Auteurs

Mehrsa Mardikoraem (M)

Daniel Woldring (D)

Classifications MeSH