Conformal prediction under feedback covariate shift for biomolecular design.
conformal prediction
machine learning
protein engineering
uncertainty quantification
Journal
Proceedings of the National Academy of Sciences of the United States of America
ISSN: 1091-6490
Titre abrégé: Proc Natl Acad Sci U S A
Pays: United States
ID NLM: 7505876
Informations de publication
Date de publication:
25 10 2022
25 10 2022
Historique:
entrez:
18
10
2022
pubmed:
19
10
2022
medline:
21
10
2022
Statut:
ppublish
Résumé
Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.
Identifiants
pubmed: 36256807
doi: 10.1073/pnas.2204569119
pmc: PMC9618043
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
e2204569119Références
Nat Commun. 2021 Apr 22;12(1):2366
pubmed: 33888692
Science. 2020 Jul 24;369(6502):440-445
pubmed: 32703877
Proc Natl Acad Sci U S A. 2019 Apr 30;116(18):8852-8858
pubmed: 30979809
Proc Natl Acad Sci U S A. 2017 Jun 13;114(24):E4812-E4821
pubmed: 28559317
Sci Transl Med. 2013 Jun 12;5(189):189ra76
pubmed: 23761039
Curr Opin Chem Biol. 2021 Dec;65:18-27
pubmed: 34051682
Nat Biotechnol. 2006 Feb;24(2):198-204
pubmed: 16429148
Nat Biotechnol. 2007 Mar;25(3):338-44
pubmed: 17322872
Nat Biotechnol. 2021 Jun;39(6):691-696
pubmed: 33574611
Proc Natl Acad Sci U S A. 2022 Jan 4;119(1):
pubmed: 34937698
Proc Natl Acad Sci U S A. 2013 Jan 15;110(3):E193-201
pubmed: 23277561
Nat Commun. 2019 Sep 16;10(1):4213
pubmed: 31527666
PLoS Comput Biol. 2021 Feb 26;17(2):e1008736
pubmed: 33635868
Nat Commun. 2021 Oct 5;12(1):5825
pubmed: 34611172
Nat Commun. 2021 Apr 23;12(1):2403
pubmed: 33893299
Bioinformatics. 2020 Apr 1;36(7):2126-2133
pubmed: 31778140
Nat Methods. 2021 Apr;18(4):389-396
pubmed: 33828272
Nat Biotechnol. 2022 Jul;40(7):1114-1122
pubmed: 35039677
Nat Methods. 2019 Nov;16(11):1176-1184
pubmed: 31611694
Science. 2019 Nov 29;366(6469):1139-1143
pubmed: 31780559
Sci Adv. 2018 Jul 25;4(7):eaap7885
pubmed: 30050984
Nat Methods. 2019 Aug;16(8):687-694
pubmed: 31308553
Cell Syst. 2020 Jul 22;11(1):49-62.e16
pubmed: 32711843
J Chem Inf Model. 2019 Jan 28;59(1):43-52
pubmed: 30016587
Nat Commun. 2014;5:3075
pubmed: 24435020
Cell Syst. 2019 Aug 28;9(2):159-166.e3
pubmed: 31176619
Nat Biotechnol. 2007 Sep;25(9):1051-6
pubmed: 17721510
Cell Syst. 2021 Nov 17;12(11):1026-1045.e7
pubmed: 34416172
ACS Cent Sci. 2018 Feb 28;4(2):268-276
pubmed: 29532027
ACS Cent Sci. 2021 Aug 25;7(8):1356-1367
pubmed: 34471680
ACS Synth Biol. 2020 Aug 21;9(8):2154-2161
pubmed: 32649182