Discovery of sparse, reliable omic biomarkers with Stabl.
Journal
Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648
Informations de publication
Date de publication:
02 Jan 2024
02 Jan 2024
Historique:
received:
20
02
2023
accepted:
16
10
2023
medline:
4
1
2024
pubmed:
4
1
2024
entrez:
3
1
2024
Statut:
aheadofprint
Résumé
Adoption of high-content omic technologies in clinical studies, coupled with computational methods, has yielded an abundance of candidate biomarkers. However, translating such findings into bona fide clinical biomarkers remains challenging. To facilitate this process, we introduce Stabl, a general machine learning method that identifies a sparse, reliable set of biomarkers by integrating noise injection and a data-driven signal-to-noise threshold into multivariable predictive modeling. Evaluation of Stabl on synthetic datasets and five independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used sparsity-promoting regularization methods while maintaining predictive performance; it distills datasets containing 1,400-35,000 features down to 4-34 candidate biomarkers. Stabl extends to multi-omic integration tasks, enabling biological interpretation of complex predictive models, as it hones in on a shortlist of proteomic, metabolomic and cytometric events predicting labor onset, microbial biomarkers of pre-term birth and a pre-operative immune signature of post-surgical infections. Stabl is available at https://github.com/gregbellan/Stabl .
Identifiants
pubmed: 38168992
doi: 10.1038/s41587-023-02033-x
pii: 10.1038/s41587-023-02033-x
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© 2024. The Author(s).
Références
Subramanian, I., Verma, S., Kumar, S., Jere, A. & Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights 14, 1177932219899051 (2020).
pubmed: 32076369
pmcid: 7003173
doi: 10.1177/1177932219899051
Wafi, A. & Mirnezami, R. Translational -omics: future potential and current challenges in precision medicine. Methods 151, 3–11 (2018).
pubmed: 29792918
doi: 10.1016/j.ymeth.2018.05.009
Jackson, H. W. et al. The single-cell pathology landscape of breast cancer. Nature 578, 615–620 (2020).
pubmed: 31959985
doi: 10.1038/s41586-019-1876-x
Fourati, S. et al. Pan-vaccine analysis reveals innate immune endotypes predictive of antibody responses to vaccination. Nat. Immunol. 23, 1777–1787 (2022).
pubmed: 36316476
pmcid: 9747610
doi: 10.1038/s41590-022-01329-5
Dunkler, D., Sánchez-Cabo, F. & Heinze, G. Statistical analysis principles for omics data. Methods Mol. Biol. 719, 113–131 (2011).
pubmed: 21370081
doi: 10.1007/978-1-61779-027-0_5
Ghosh, D. & Poisson, L. M. ‘omics’ data and levels of evidence for biomarker discovery. Genomics 93, 13–16 (2009).
pubmed: 18723089
doi: 10.1016/j.ygeno.2008.07.006
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol. 58, 267–288 (1996).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67, 301–320 (2005).
doi: 10.1111/j.1467-9868.2005.00503.x
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006).
doi: 10.1198/016214506000000735
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245 (2013).
doi: 10.1080/10618600.2012.681250
Ding, D. Y., Li, S., Narasimhan, B. & Tibshirani, R. Cooperative learning for multiview analysis. Proc. Natl Acad. Sci. USA 119, e2202113119 (2022).
pubmed: 36095183
pmcid: 9499553
doi: 10.1073/pnas.2202113119
Yang, P., Yang, J., Zhou, B. & Zomaya, A. A review of ensemble methods in bioinformatics. Curr. Bioinform. 5, 296–308 (2010).
doi: 10.2174/157489310794072508
Huan, X., Caramanis, C. & Mannor, S. Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans. Pattern Anal. Mach. Intell. 34, 187–193 (2012).
doi: 10.1109/TPAMI.2011.177
Roberts, S. & Nowak, G. Stabilizing the lasso against cross-validation variability. Comput. Stat. Data Anal. 70, 198–211 (2014).
doi: 10.1016/j.csda.2013.09.008
Homrighausen, D. & McDonald, D. The lasso, persistence, and cross-validation. Proc. of the 30th International Conference on Machine Learning 2068–2076 (PMLR, 2013).
Olivier, M., Asmis, R., Hawkins, G. A., Howard, T. D. & Cox, L. A. The need for multi-omics biomarker signatures in precision medicine. Int. J. Mol. Sci. 20, 4781 (2019).
pubmed: 31561483
pmcid: 6801754
doi: 10.3390/ijms20194781
Tarazona, S., Arzalluz-Luque, A. & Conesa, A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat. Comput. Sci. 1, 395–402 (2021).
doi: 10.1038/s43588-021-00086-z
Meinshausen, N. & Bühlmann, P. Stability selection. J. R. Stat. Soc. Series B Stat. Methodol. 72, 417–473 (2010).
doi: 10.1111/j.1467-9868.2010.00740.x
Candès, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Series B Stat. Methodol. 80, 551–577 (2018).
doi: 10.1111/rssb.12265
Bach, F. Bolasso: model consistent lasso estimation through the bootstrap. Proc. of the 25th International Conference on Machine Learning 33–40 (PMLR, 2008).
Barber, R. F. & Candès, E. J. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085 (2015).
doi: 10.1214/15-AOS1337
Ren, Z., Wei, Y. & Candès, E. Derandomizing knockoffs. J. Am. Stat. Assoc. 118, 948–958 (2023).
doi: 10.1080/01621459.2021.1962720
Weinstein, A., Barber, R. & Candès, E. A power and prediction analysis for knockoffs with lasso statistics. Preprint at https://doi.org/10.48550/arXiv.1712.06465 (2017).
Bondell, H. D. & Reich, B. J. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64, 115–123 (2008).
pubmed: 17608783
doi: 10.1111/j.1541-0420.2007.00843.x
Bates, S., Candès, E., Janson, L. & Wang, W. Metropolized knockoff sampling. J. Am. Stat. Assoc. 116, 1413–1427 (2020).
doi: 10.1080/01621459.2020.1729163
Moufarrej, M. N. et al. Early prediction of preeclampsia in pregnancy with cell-free RNA. Nature 602, 689–694 (2022).
pubmed: 35140405
pmcid: 8971130
doi: 10.1038/s41586-022-04410-z
Marić, I. et al. Early prediction and longitudinal modeling of preeclampsia from multiomics. Patterns (N Y) 3, 100655 (2022).
pubmed: 36569558
doi: 10.1016/j.patter.2022.100655
Filbin, M. R. et al. Longitudinal proteomic analysis of severe COVID-19 reveals survival-associated signatures, tissue-specific cell death, and cell–cell interactions. Cell Rep. Med. 2, 100287 (2021).
pubmed: 33969320
pmcid: 8091031
doi: 10.1016/j.xcrm.2021.100287
Feyaerts, D. et al. Integrated plasma proteomic and single-cell immune signaling network signatures demarcate mild, moderate, and severe COVID-19. Cell Rep. Med. 3, 100680 (2022).
pubmed: 35839768
pmcid: 9238057
doi: 10.1016/j.xcrm.2022.100680
Hosmer, D. & Lemeshow, S. Applied Logistic Regression 376–383 (Wiley, 2000).
Davis, K. D. et al. Discovery and validation of biomarkers to aid the development of safe and effective pain therapeutics: challenges and opportunities. Nat. Rev. Neurol. 16, 381–400 (2020).
pubmed: 32541893
pmcid: 7326705
doi: 10.1038/s41582-020-0362-2
Kasten, M. & Giordano, A. Cdk10, a Cdc2-related kinase, associates with the Ets2 transcription factor and modulates its transactivation activity. Oncogene 20, 1832–1838 (2001).
pubmed: 11313931
doi: 10.1038/sj.onc.1204295
Markovic, S. S. et al. Galectin-1 as the new player in staging and prognosis of COVID-19. Sci. Rep. 12, 1272 (2022).
pubmed: 35075140
pmcid: 8786829
doi: 10.1038/s41598-021-04602-z
COvid-19 Multi-omics Blood ATlas (COMBAT) Consortium. A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell 185, 916–938 (2022).
doi: 10.1016/j.cell.2022.01.012
Mayr, C. H. et al. Integrative analysis of cell state changes in lung fibrosis with peripheral protein biomarkers. EMBO Mol. Med. 13, e12871 (2021).
pubmed: 33650774
pmcid: 8033531
doi: 10.15252/emmm.202012871
Overmyer, K. A. et al. Large-scale multi-omic analysis of COVID-19 severity. Cell Syst. 12, 23–40 (2021).
Mohammed, Y. et al. Longitudinal plasma proteomics analysis reveals novel candidate biomarkers in acute COVID-19. J. Proteome Res. 21, 975–992 (2022).
pubmed: 35143212
doi: 10.1021/acs.jproteome.1c00863
Stelzer, I. A. et al. Integrated trajectories of the maternal metabolome, proteome, and immunome predict labor onset. Sci. Transl. Med. 13, eabd9898 (2021).
pubmed: 33952678
pmcid: 8136601
doi: 10.1126/scitranslmed.abd9898
Suff, N., Story, L. & Shennan, A. The prediction of preterm delivery: what is new? Semin. Fetal Neonatal Med. 24, 27–32 (2019).
pubmed: 30337215
doi: 10.1016/j.siny.2018.09.006
Marquette, G. P., Hutcheon, J. A. & Lee, L. Predicting the spontaneous onset of labour in post-date pregnancies: a population-based retrospective cohort study. J. Obstet. Gynaecol. Can. 36, 391–399 (2014).
pubmed: 24927290
doi: 10.1016/S1701-2163(15)30584-3
Shah, N. et al. Changes in T cell and dendritic cell phenotype from mid to late pregnancy are indicative of a shift from immune tolerance to immune activation. Front. Immunol. 8, 1138 (2017).
pubmed: 28966619
pmcid: 5605754
doi: 10.3389/fimmu.2017.01138
Kraus, T. A. et al. Characterizing the pregnancy immune phenotype: results of the viral immunity and pregnancy (VIP) study. J. Clin. Immunol. 32, 300–311 (2012).
pubmed: 22198680
doi: 10.1007/s10875-011-9627-2
Shah, N. M., Lai, P. F., Imami, N. & Johnson, M. R. Progesterone-related immune modulation of pregnancy and labor. Front. Endocrinol. 10, 198 (2019).
doi: 10.3389/fendo.2019.00198
Brinkman-Van der Linden, E. C. M. et al. Human-specific expression of Siglec-6 in the placenta. Glycobiology 17, 922–931 (2007).
pubmed: 17580316
doi: 10.1093/glycob/cwm065
Kappou, D., Sifakis, S., Konstantinidou, A., Papantoniou, N. & Spandidos, D. A. Role of the angiopoietin/tie system in pregnancy (Review). Exp. Ther. Med. 9, 1091–1096 (2015).
pubmed: 25780392
pmcid: 4353758
doi: 10.3892/etm.2015.2280
Huang, B. et al. Interleukin-33-induced expression of PIBF1 by decidual B cells protects against preterm labor. Nat. Med. 23, 128–135 (2017).
pubmed: 27918564
doi: 10.1038/nm.4244
Li, A., Lee, R. H., Felix, J. C., Minoo, P. & Goodwin, T. M. Alteration of secretory leukocyte protease inhibitor in human myometrium during labor. Am. J. Obstet. Gynecol. 200, 311.e1–311.e10 (2009).
pubmed: 19254589
doi: 10.1016/j.ajog.2008.10.045
Golob, J. L. et al. Microbiome preterm birth dream challenge: crowdsourcing machine learning approaches to advance preterm birth research. Preprint at medRxiv https://doi.org/10.1101/2023.03.07.23286920 (2023).
Minot, S. S. et al. Robust harmonization of microbiome studies by phylogenetic scaffolding with MaLiAmPi. Cell Rep. Methods 3, 100639 (2023).
pubmed: 37939711
pmcid: 10694490
doi: 10.1016/j.crmeth.2023.100639
Tosato, G. & Jones, K. D. Interleukin-1 induces interleukin-6 production in peripheral blood monocytes. Blood 75, 1305–1310 (1990).
pubmed: 2310829
doi: 10.1182/blood.V75.6.1305.1305
Lee, J.-K. et al. Differences in signaling pathways by IL-1β and IL-18. Proc. Natl Acad. Sci. USA 101, 8815–8820 (2004).
pubmed: 15161979
pmcid: 423278
doi: 10.1073/pnas.0402800101
Fong, T. G. et al. Identification of plasma proteome signatures associated with surgery using SOMAscan. Ann. Surg. 273, 732–742 (2021).
pubmed: 30946084
doi: 10.1097/SLA.0000000000003283
Rumer, K. K. et al. Integrated single-cell and plasma proteomic modeling to predict surgical site complications: a prospective cohort study. Ann. Surg. 275, 582–590 (2022).
pubmed: 34954754
doi: 10.1097/SLA.0000000000005348
He, K. et al. A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics. Preprint at https://doi.org/10.48550/arXiv.1501.00537 (2015).
He, K., Li, M.-J., Fu, Y., Gong, F.-Z. & Sun, X.-M. Null-free false discovery rate control using decoy permutations. Acta Math. Appl. Sin. 38, 235–253 (2022).
pubmed: 35431377
pmcid: 8994022
doi: 10.1007/s10255-022-1077-5
Weinstein, A., Su, W. J., Bogdan, M., Barber, R. F. & Candès, E. J. A power analysis for Model-X knockoffs with ℓ
Romano, Y., Sesia, M. & Candès, E. Deep knockoffs. J. Am. Stat. Assoc. 115, 1861–1872 (2019).
doi: 10.1080/01621459.2019.1660174
Chernozhukov, V. et al. Double/debiased machine learning for treatment and structural parameters. Econometrics J. 21, C1–C68 (2018).
doi: 10.1111/ectj.12097
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
doi: 10.18637/jss.v036.i11
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
doi: 10.1023/A:1010933404324
Friedman, J. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378 (2002).
doi: 10.1016/S0167-9473(01)00065-2
Candes, E. & Tao, T. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007).
Bickel, P. J., Ritov, Y. & Tsybakov, A. B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009).
doi: 10.1214/08-AOS620
Bühlmann, P. & Van De Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications (Springer, 2011).
Zhao, P. & Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006).
Zhang, C.-H. & Huang, J. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36, 1567–1594 (2008).
doi: 10.1214/07-AOS520
Javanmard, A. & Montanari, A. Model selection for high-dimensional regression under the generalized irrepresentability condition. Proc. of the 26th International Conference on Neural Information Processing Systems 3012–3020 (Curran Associates, 2013).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Methodol. 57, 289–300 (1995).
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–499 (2004).
doi: 10.1214/009053604000000067
Meinshausen, N. & Bühlmann, P. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34, 1436–1462 (2006).
doi: 10.1214/009053606000000281
Celentano, M., Montanari, A. & Wei, Y. The Lasso with general Gaussian designs with applications to hypothesis testing. Preprint at https://doi.org/10.48550/arXiv.2007.13716 (2020).
Cario, M. C. & Nelson, B. L. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. http://www.ressources-actuarielles.net/EXT/ISFA/1226.nsf/769998e0a65ea348c1257052003eb94f/5d499a3efc8ae4dfc125756c00391ca6/$FILE/NORTA.pdf (1997).
Kurtz, Z. D. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 11, e1004226 (2015).
pubmed: 25950956
pmcid: 4423992
doi: 10.1371/journal.pcbi.1004226
McGregor, K., Labbe, A. & Greenwood, C. M. MDiNE: a model to estimate differential co-occurrence networks in microbiome studies. Bioinformatics 36, 1840–1847 (2020).
pubmed: 31697315
doi: 10.1093/bioinformatics/btz824
Wang, Y. & Lê Cao, K.-A. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Brief. Bioinformatics 24, bbac622 (2023).
pubmed: 36653900
doi: 10.1093/bib/bbac622
American College of Obstetricians and Gynecologists. Gestational hypertension and preeclampsia: ACOG practice bulletin, number 222. Obstet. Gynecol. 135, e237–e260 (2020).
Assarsson, E. et al. Homogenous 96-plex PEA immunoassay exhibiting high sensitivity, specificity, and excellent scalability. PLoS ONE 9, e95192 (2014).
pubmed: 24755770
pmcid: 3995906
doi: 10.1371/journal.pone.0095192
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
pubmed: 24451623
pmcid: 3998144
doi: 10.1093/bioinformatics/btu033
Barbera, P. et al. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68, 365–369 (2019).
pubmed: 30165689
doi: 10.1093/sysbio/syy054
France, M. T. et al. VALENCIA: a nearest centroid classification method for vaginal microbial communities based on composition. Microbiome 8, 166 (2020).
pubmed: 33228810
pmcid: 7684964
doi: 10.1186/s40168-020-00934-6
Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Series B Methodol. 44, 139–177 (1982).
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
pubmed: 7063747
doi: 10.1148/radiology.143.1.7063747
Gold, L. et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. Nat. Prec. https://doi.org/10.1038/npre.2010.4538.1 (2010).
Rohloff, J. C. et al. Nucleic acid ligands with protein-like side chains: modified aptamers and their use as diagnostic and therapeutic agents. Mol. Ther. Nucleic Acids 3, e201 (2014).
pubmed: 25291143
pmcid: 4217074
doi: 10.1038/mtna.2014.49