Machine learning in rare disease.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
Jun 2023
Historique:
received: 16 03 2021
accepted: 22 04 2023
medline: 12 6 2023
pubmed: 30 5 2023
entrez: 29 5 2023
Statut: ppublish

Résumé

High-throughput profiling methods (such as genomics or imaging) have accelerated basic research and made deep molecular characterization of patient samples routine. These approaches provide a rich portrait of genes, molecular pathways and cell types involved in disease phenotypes. Machine learning (ML) can be a useful tool for extracting disease-relevant patterns from high-dimensional datasets. However, depending upon the complexity of the biological question, machine learning often requires many samples to identify recurrent and biologically meaningful patterns. Rare diseases are inherently limited in clinical cases, leading to few samples to study. In this Perspective, we outline the challenges and emerging solutions for using ML for small sample sets, specifically in rare diseases. Advances in ML methods for rare diseases are likely to be informative for applications beyond rare diseases for which few samples exist with high-dimensional data. We propose that the method community prioritize the development of ML techniques for rare disease research.

Identifiants

pubmed: 37248386
doi: 10.1038/s41592-023-01886-z
pii: 10.1038/s41592-023-01886-z
doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

IM

Pagination

803-814

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG010067
Pays : United States

Informations de copyright

© 2023. Springer Nature America, Inc.

Références

Schaefer, J., Lehne, M., Schepers, J., Prasser, F. & Thun, S. The use of machine learning in rare diseases: a scoping review. Orphanet J. Rare Dis. 15, 145 (2020).
Decherchi, S., Pedrini, E., Mordenti, M., Cavalli, A. & Sangiorgi, L. Opportunities and challenges for machine learning in rare diseases. Front. Med. 8, 747612 (2021).
Li, A. et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009).
pubmed: 19244127 pmcid: 2845963
Senate and House of Representatives of the United States of America in Congress. Orphan Drug Act (1983).
Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).
pubmed: 27174893 pmcid: 5070523
Frénay, B. & Verleysen, M. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2014).
pubmed: 24808033
Toh, T. S., Dondelinger, F. & Wang, D. Looking beyond the hype: applied AI and machine learning in translational medicine. EBioMedicine 47, 607–615 (2019).
pubmed: 31466916 pmcid: 6796516
Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008).
pubmed: 18097463 pmcid: 2238676
Altman, N. & Krzywinski, M. The curse(s) of dimensionality. Nat. Methods 15, 399–400 (2018).
pubmed: 29855577
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
pubmed: 16632515
Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
pubmed: 25294822 pmcid: 4245966
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
pubmed: 20196867 pmcid: 2864565
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
pubmed: 31780648 pmcid: 6882829
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
pubmed: 32210240 pmcid: 7093466
Chellappa, R. & Turaga, P. Feature selection. In Computer Vision: a Reference Guide 1–5 (Springer International, 2020).
Chen, C.-H., Härdle, W. & Unwin, A. Handbook of Data Visualization (Springer, 2008).
Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).
pubmed: 26953178 pmcid: 4792409
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://doi.org/10.48550/arXiv.1802.03426 (2018).
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15, e1006907 (2019).
pubmed: 31220072 pmcid: 6586259
Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, https://doi.org/10.23915/distill.00002 (2016).
Way, G. P., Zietz, M., Rubinetti, V., Himmelstein, D. S. & Greene, C. S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol. 21, 109 (2020).
pubmed: 32393369 pmcid: 7212571
de Souto, M. C. P., Costa, I. G., de Araujo, D. S. A., Ludermir, T. B. & Schliep, A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497 (2008).
Kothari, S. et al. Removing batch effects from histopathological images for enhanced cancer diagnosis. IEEE J. Biomed. Health Inform. 18, 765–772 (2014).
pubmed: 24808220 pmcid: 5003052
Dwivedi, S. K., Tjärnberg, A., Tegnér, J. & Gustafsson, M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat. Commun. 11, 856 (2020).
pubmed: 32051402 pmcid: 7016183
Fertig, E. J., Ding, J., Favorov, A. V., Parmigiani, G. & Ochs, M. F. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26, 2792–2793 (2010).
pubmed: 20810601 pmcid: 3025742
Quellec, G., Lamard, M., Conze, P.-H., Massin, P. & Cochener, B. Automatic detection of rare pathologies in fundus photographs using few-shot learning. Med. Image Anal. 61, 101660 (2020).
pubmed: 32028213
Arvaniti, E. & Claassen, M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat. Commun. 8, 14825 (2017).
pubmed: 28382969 pmcid: 5384229
Chaabane, I., Guermazi, R. & Hammami, M. Enhancing techniques for learning decision trees from imbalanced data. Adv. Data Anal. Classif. 14, 677–745 (2020).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Köpcke, F. et al. Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Med. Inform. Decis. Mak. 13, 134 (2013).
pubmed: 24321610 pmcid: 4029400
Banerjee, J. et al. Integrative analysis identifies candidate tumor microenvironment and intracellular signaling pathways that define tumor heterogeneity in NF1. Genes 11, 226 (2020).
Colbaugh, R., Glass, K., Rudolf, C., & Tremblay, M. Learning to identify rare disease patients from electronic health records. AMIA Annu. Symp. Proc. 2018, 340–347 (2018).
pubmed: 30815073 pmcid: 6371307
Heiselet, B., Serre, T., Pontil, M. & Poggio, T. Component-based face detection. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition I (CPRV, 2001).
Kasinski, A. & Schmidt, A. The architecture of the face and eyes detection system based on cascade classifiers. In Computer Recognition Systems 2 (ed. Kurzynski, M. et al.) 124–131 (Springer, 2007).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv https://doi.org/10.48550/arXiv.1301.3781 (2013).
Han, S., Williamson, B. D. & Fong, Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inform. Decis. Mak. 21, 322 (2021).
pubmed: 34809631 pmcid: 8607560
Ambert, K. H. & Cohen, A. M. A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection. J. Am. Med. Inform. Assoc. 16, 590–595 (2009).
pubmed: 19390099 pmcid: 2705265
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
More, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. Preprint at arXiv https://doi.org/10.48550/arXiv.1608.06048 (2016).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT, 2016).
Futoma, J., Simons, M., Doshi-Velez, F. & Kamaleswaran, R. Generalization in clinical prediction models: the blessing and curse of measurement indicator variables. Crit. Care Explor. 3, e0453 (2021).
Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10, e1004754 (2014).
pubmed: 25393026 pmcid: 4230844
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B Stat. Methodol. 67, 301–320 (2005).
Founta, K. et al. Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning. Mol. Med. 29, 12 (2023).
pubmed: 36694130 pmcid: 9872307
Torang, A., Gupta, P. & Klinke, D. J. 2nd An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets. BMC Bioinformatics 20, 433 (2019).
Dincer, A. B., Celik, S., Hiranuma, N. & Lee, S.-I. DeepProfile: deep learning of cancer molecular profiles for precision medicine. Preprint at bioRxiv https://doi.org/10.1101/278739 (2018).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).
Sánchez Fernández, I. et al. Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex. PLoS ONE 15, e0232376 (2020).
pubmed: 32348367 pmcid: 7190137
Mungall, C. J. et al. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 45, D712–D722 (2017).
pubmed: 27899636
Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife 6, e26726 (2017).
pubmed: 28936969 pmcid: 5640425
Callahan, T. J., Tripodi, I. J., Hunter, L. E. & Baumgartner, W. A. A framework for automated construction of heterogeneous large-scale biomedical knowledge graphs. Preprint at bioRxiv https://doi.org/10.1101/2020.04.30.071407 (2020).
Percha, B. & Altman, R. B. A global network of biomedical relationships derived from text. Bioinformatics 34, 2614–2624 (2018).
pubmed: 29490008 pmcid: 6061699
Orphanet https://www.orpha.net/consor/cgi-bin/index.php (2023).
Queralt-Rosinach, N. et al. Structured reviews for data and knowledge-driven research. Database 2020, baaa015 (2020).
pubmed: 32283553 pmcid: 7153956
Moon, C. et al. Learning drug–disease–target embedding (DDTE) from knowledge graphs to inform drug repurposing hypotheses. J. Biomed. Inform. 119, 103838 (2021).
pubmed: 34119691
Li, X. et al. Improving rare disease classification using imperfect knowledge graph. BMC Med. Inform. Decis. Mak. 19, 238 (2019).
pubmed: 31801534 pmcid: 6894101
Sosa, D. N. et al. A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. In Biocomputing 2020 463–474 (World Scientific, 2019).
Shen, F. et al. Rare disease knowledge enrichment through a data-driven approach. BMC Med. Inform. Decis. Mak. 19, 32 (2019).
pubmed: 30764825 pmcid: 6376651
Rao, A. et al. Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks. BMC Med. Genomics 11, 57 (2018).
Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021).
pubmed: 33264411
Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
pubmed: 25416956 pmcid: 4266588
Martens, M. et al. WikiPathways: connecting communities. Nucleic Acids Res. 49, D613–D621 (2021).
pubmed: 33211851
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Lee, S.-I. et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42 (2018).
pubmed: 29298978 pmcid: 5752671
Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C. & Chikina, M. Pathway-level information extractor (PLIER) for gene expression data. Nat. Methods 16, 607–610 (2019).
pubmed: 31249421 pmcid: 7262669
Taroni, J. N. et al. MultiPLIER: a transfer learning framework for transcriptomics reveals systemic features of rare disease. Cell Syst. 8, 380–394 (2019).
pubmed: 31121115 pmcid: 6538307
Greene, D., NIHR BioResource, Richardson, S. & Turro, E. Phenotype similarity regression for identifying the genetic determinants of rare diseases. Am. J. Hum. Genet. 98, 490–499 (2016).
pubmed: 26924528 pmcid: 4827100
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
pubmed: 21737059 pmcid: 3135811
Ionita-Laza, I., Capanu, M., De Rubeis, S., McCallum, K. & Buxbaum, J. D. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet. 10, e1004729 (2014).
pubmed: 25502226 pmcid: 4263785
Greene, D., NIHR BioResource, Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Am. J. Hum. Genet. 101, 104–114 (2017).
pubmed: 28669401 pmcid: 5501869
Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 14, 681–691 (2013).
pubmed: 23999272
Wright, C. F., FitzPatrick, D. R. & Firth, H. V. Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 19, 253–268 (2018).
pubmed: 29398702
Adams, D. R. & Eng, C. M. Next-generation sequencing to diagnose suspected genetic disorders. N. Engl. J. Med. 379, 1353–1362 (2018).
pubmed: 30281996
Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).
pubmed: 32694666 pmcid: 7974070
Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).
pubmed: 33015372 pmcid: 7490367
Yan, Y. et al. A continuously benchmarked and crowdsourced challenge for rapid development and evaluation of models to predict COVID-19 diagnosis and hospitalization. JAMA Netw. Open 4, e2124946 (2021).
pubmed: 34633425 pmcid: 8506231
Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).
pubmed: 31001455 pmcid: 6467492
Zhou, G., Zhang, J., Su, J., Shen, D. & Tan, C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20, 1178–1190 (2004).
pubmed: 14871877
Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (eds. Jurafsky, D. & Gaussier, E.) 120–128 (Association for Computational Linguistics, 2006).
Wang, C. & Mahadevan, S. Heterogeneous domain adaptation using manifold alignment. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence 2 (ed. Walsh, T.) 1541–1546 (AAAI, 2011).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
pubmed: 31178118 pmcid: 6687398
Collado-Torres, L. et al. Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35, 319–321 (2017).
pubmed: 28398307 pmcid: 6742427
Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, 2013).
Davis, J. & Goadrich, M. The relationship between precision–recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (eds. Cohen, W. W. & Moore, A.) 233–240 (Association for Computing Machinery, 2006).
Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning (Springer, 2001).
Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).
pubmed: 26886976

Auteurs

Jineta Banerjee (J)

Sage Bionetworks, Seattle, WA, USA.

Jaclyn N Taroni (JN)

Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA.

Robert J Allaway (RJ)

Sage Bionetworks, Seattle, WA, USA.

Deepashree Venkatesh Prasad (DV)

Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA.

Justin Guinney (J)

Sage Bionetworks, Seattle, WA, USA.

Casey Greene (C)

Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA. casey.s.greene@cuanschutz.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH