An open-source framework for end-to-end analysis of electronic health record data.


Journal

Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015

Informations de publication

Date de publication:
12 Sep 2024
Historique:
received: 11 12 2023
accepted: 25 07 2024
medline: 13 9 2024
pubmed: 13 9 2024
entrez: 12 9 2024
Statut: aheadofprint

Résumé

With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models, paving the way for foundational models in biomedical research. We demonstrate ehrapy's features in six distinct examples. We applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we reveal biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. We reconstructed disease state trajectories in patients with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) based on imaging data. Finally, we conducted a case study to demonstrate how ehrapy can detect and mitigate biases in EHR data. ehrapy, thus, provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.

Identifiants

pubmed: 39266748
doi: 10.1038/s41591-024-03214-0
pii: 10.1038/s41591-024-03214-0
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© 2024. The Author(s).

Références

Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215–E220 (2000).
pubmed: 10851218 doi: 10.1161/01.CIR.101.23.e215
Atasoy, H., Greenwood, B. N. & McCullough, J. S. The digitization of patient care: a review of the effects of electronic health records on health care quality and utilization. Annu. Rev. Public Health 40, 487–500 (2019).
pubmed: 30566385 doi: 10.1146/annurev-publhealth-040218-044206
Jamoom, E. W., Patel, V., Furukawa, M. F. & King, J. EHR adopters vs. non-adopters: impacts of, barriers to, and federal initiatives for EHR adoption. Health (Amst.) 2, 33–39 (2014).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 18 (2018).
pubmed: 31304302 pmcid: 6550175 doi: 10.1038/s41746-018-0029-1
Wolf, A. et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. Int. J. Epidemiol. 48, 1740–1740g (2019).
pubmed: 30859197 pmcid: 6929522 doi: 10.1093/ije/dyz034
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
pubmed: 25826379 pmcid: 4380465 doi: 10.1371/journal.pmed.1001779
Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5, 180178 (2018).
pubmed: 30204154 pmcid: 6132188 doi: 10.1038/sdata.2018.178
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
pubmed: 27219127 pmcid: 4878278 doi: 10.1038/sdata.2016.35
Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26, 364–373 (2020).
pubmed: 32152583 doi: 10.1038/s41591-020-0789-4
Rasmy, L. et al. Recurrent neural network models (CovRNN) for predicting outcomes of patients with COVID-19 on admission to hospital: model development and validation using electronic health record data. Lancet Digit. Health 4, e415–e425 (2022).
pubmed: 35466079 pmcid: 9023005 doi: 10.1016/S2589-7500(22)00049-8
Marcus, J. L. et al. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 6, e688–e695 (2019).
pubmed: 31285183 pmcid: 7152802 doi: 10.1016/S2352-3018(19)30137-7
Kruse, C. S., Stein, A., Thomas, H. & Kaur, H. The use of electronic health records to support population health: a systematic review of the literature. J. Med. Syst. 42, 214 (2018).
pubmed: 30269237 pmcid: 6182727 doi: 10.1007/s10916-018-1075-6
Sheikh, A., Jha, A., Cresswell, K., Greaves, F. & Bates, D. W. Adoption of electronic health records in UK hospitals: lessons from the USA. Lancet 384, 8–9 (2014).
pubmed: 24998803 doi: 10.1016/S0140-6736(14)61099-0
Sheikh, A. et al. Health information technology and digital innovation for national learning health and care systems. Lancet Digit. Health 3, e383–e396 (2021).
pubmed: 33967002 doi: 10.1016/S2589-7500(21)00005-4
Cord, K. A. M., Mc Cord, K. A. & Hemkens, L. G. Using electronic health records for clinical trials: where do we stand and where can we go? Can. Med. Assoc. J. 191, E128–E133 (2019).
doi: 10.1503/cmaj.180841
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).
pubmed: 32699826 pmcid: 7367859 doi: 10.1038/s41746-020-0301-z
Ayaz, M., Pasha, M. F., Alzahrani, M. Y., Budiarto, R. & Stiawan, D. The Fast Health Interoperability Resources (FHIR) standard: systematic literature review of implementations, applications, challenges and opportunities. JMIR Med. Inform. 9, e21929 (2021).
pubmed: 34328424 pmcid: 8367140 doi: 10.2196/21929
Peskoe, S. B. et al. Adjusting for selection bias due to missing data in electronic health records-based research. Stat. Methods Med. Res. 30, 2221–2238 (2021).
pubmed: 34445911 pmcid: 10942747 doi: 10.1177/09622802211027601
Haneuse, S. & Daniels, M. A general framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash. DC) 4, 1203 (2016).
pubmed: 27668265
Gallifant, J. et al. Disparity dashboards: an evaluation of the literature and framework for health equity improvement. Lancet Digit. Health 5, e831–e839 (2023).
pubmed: 37890905 pmcid: 10639125 doi: 10.1016/S2589-7500(23)00150-4
Sauer, C. M. et al. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit. Health 4, e893–e898 (2022).
pubmed: 36154811 doi: 10.1016/S2589-7500(22)00154-6
Li, J. et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit. Med. 4, 147 (2021).
pubmed: 34635760 pmcid: 8505441 doi: 10.1038/s41746-021-00518-0
Rubin, D. B. Inference and missing data. Biometrika 63, 581 (1976).
doi: 10.1093/biomet/63.3.581
Scheid, L. M., Brown, L. S., Clark, C. & Rosenfeld, C. R. Data electronically extracted from the electronic health record require validation. J. Perinatol. 39, 468–474 (2019).
pubmed: 30679823 doi: 10.1038/s41372-018-0311-8
Phelan, M., Bhavsar, N. A. & Goldstein, B. A. Illustrating informed presence bias in electronic health records data: how patient interactions with a health system can impact inference. EGEMS (Wash. DC). 5, 22 (2017).
pubmed: 29930963 pmcid: 5994954
Secondary Analysis of Electronic Health Records (ed MIT Critical Data) (Springer, 2016).
Jetley, G. & Zhang, H. Electronic health records in IS research: quality issues, essential thresholds and remedial actions. Decis. Support Syst. 126, 113137 (2019).
doi: 10.1016/j.dss.2019.113137
McCormack, J. P. & Holmes, D. T. Your results may vary: the imprecision of medical measurements. BMJ 368, m149 (2020).
pubmed: 32079593 doi: 10.1136/bmj.m149
Hobbs, F. D. et al. Is the international normalised ratio (INR) reliable? A trial of comparative measurements in hospital laboratory and primary care settings. J. Clin. Pathol. 52, 494–497 (1999).
pubmed: 10605400 pmcid: 501488 doi: 10.1136/jcp.52.7.494
Huguet, N. et al. Using electronic health records in longitudinal studies: estimating patient attrition. Med. Care 58 Suppl 6 Suppl 1, S46–S52 (2020).
Zeng, J., Gensheimer, M. F., Rubin, D. L., Athey, S. & Shachter, R. D. Uncovering interpretable potential confounders in electronic medical records. Nat. Commun. 13, 1014 (2022).
pubmed: 35197467 pmcid: 8866497 doi: 10.1038/s41467-022-28546-8
Getzen, E., Ungar, L., Mowery, D., Jiang, X. & Long, Q. Mining for equitable health: assessing the impact of missing data in electronic health records. J. Biomed. Inform. 139, 104269 (2023).
pubmed: 36621750 pmcid: 10391553 doi: 10.1016/j.jbi.2022.104269
Tang, S. et al. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J. Am. Med. Inform. Assoc. 27, 1921–1934 (2020).
pubmed: 33040151 pmcid: 7727385 doi: 10.1093/jamia/ocaa139
Dagliati, A. et al. A process mining pipeline to characterize COVID-19 patients’ trajectories and identify relevant temporal phenotypes from EHR data. Front. Public Health 10, 815674 (2022).
pubmed: 35677768 pmcid: 9168006 doi: 10.3389/fpubh.2022.815674
Sun, Y. & Zhou, Y.-H. A machine learning pipeline for mortality prediction in the ICU. Int. J. Digit. Health 2, 3 (2022).
doi: 10.29337/ijdh.44
Mandyam, A., Yoo, E. C., Soules, J., Laudanski, K. & Engelhardt, B. E. COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks. In Proc. of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3459930.3469536 (Association for Computing Machinery, 2021).
Gao, C. A. et al. A machine learning approach identifies unresolving secondary pneumonia as a contributor to mortality in patients with severe pneumonia, including COVID-19. J. Clin. Invest. 133, e170682 (2023).
Makam, A. N. et al. The good, the bad and the early adopters: providers’ attitudes about a common, commercial EHR. J. Eval. Clin. Pract. 20, 36–42 (2014).
pubmed: 23962319 doi: 10.1111/jep.12076
Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).
pubmed: 31792435 doi: 10.1038/s41592-019-0654-x
Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41, 604–606 (2023).
pubmed: 37037904 doi: 10.1038/s41587-023-01733-8
Zou, Q. et al. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9, 515 (2018).
pubmed: 30459809 pmcid: 6232260 doi: 10.3389/fgene.2018.00515
Cios, K. J. & William Moore, G. Uniqueness of medical data mining. Artif. Intell. Med. 26, 1–24 (2002).
pubmed: 12234714 doi: 10.1016/S0933-3657(02)00049-0
Zeng, X. et al. PIC, a paediatric-specific intensive care database. Sci. Data 7, 14 (2020).
pubmed: 31932583 pmcid: 6957490 doi: 10.1038/s41597-020-0355-4
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
pubmed: 30305743 pmcid: 6786975 doi: 10.1038/s41586-018-0579-z
Lee, J. et al. Open-access MIMIC-II database for intensive care research. Annu. Int. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2011, 8315–8318 (2011).
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 22, 553–564 (2015).
pubmed: 25670757 pmcid: 4457111 doi: 10.1093/jamia/ocu023
Vasilevsky, N. A. et al. Mondo: unifying diseases for the world, by the world. Preprint at medRxiv https://doi.org/10.1101/2022.04.13.22273750 (2022).
Harrison, J. E., Weber, S., Jakob, R. & Chute, C. G. ICD-11: an international classification of diseases for the twenty-first century. BMC Med. Inform. Decis. Mak. 21, 206 (2021).
pubmed: 34753471 pmcid: 8577172 doi: 10.1186/s12911-021-01534-6
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
pubmed: 30476213 doi: 10.1093/nar/gky1105
Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).
pubmed: 31553307 pmcid: 6911227 doi: 10.2196/14325
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
pubmed: 29409532 pmcid: 5802054 doi: 10.1186/s13059-017-1382-0
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
de Haan-Rietdijk, S., de Haan-Rietdijk, S., Kuppens, P. & Hamaker, E. L. What’s in a day? A guide to decomposing the variance in intensive longitudinal data. Front. Psychol. 7, 891 (2016).
pubmed: 27378986 pmcid: 4906027
Pedersen, E. S. L., Danquah, I. H., Petersen, C. B. & Tolstrup, J. S. Intra-individual variability in day-to-day and month-to-month measurements of physical activity and sedentary behaviour at work and in leisure-time among Danish adults. BMC Public Health 16, 1222 (2016).
pubmed: 27914468 pmcid: 5135790 doi: 10.1186/s12889-016-3890-3
Roffey, D. M., Byrne, N. M. & Hills, A. P. Day-to-day variance in measurement of resting metabolic rate using ventilated-hood and mouthpiece & nose-clip indirect calorimetry systems. JPEN J. Parenter. Enter. Nutr. 30, 426–432 (2006).
doi: 10.1177/0148607106030005426
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
pubmed: 27571553 doi: 10.1038/nmeth.3971
Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Methods 19, 159–170 (2022).
pubmed: 35027767 pmcid: 8828480 doi: 10.1038/s41592-021-01346-6
Weiler, P., Lange, M., Klein, M., Pe'er, D. & Theis, F. CellRank 2: unified fate mapping in multiview single-cell data. Nat. Methods 21, 1196–1205 (2024).
Zhang, S. et al. Cost of management of severe pneumonia in young children: systematic analysis. J. Glob. Health 6, 010408 (2016).
pubmed: 27231544 pmcid: 4871066 doi: 10.7189/jogh.06.010408
Torres, A. et al. Pneumonia. Nat. Rev. Dis. Prim. 7, 25 (2021).
pubmed: 33833230 doi: 10.1038/s41572-021-00259-0
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
pubmed: 30914743 pmcid: 6435756 doi: 10.1038/s41598-019-41695-z
Kamin, W. et al. Liver involvement in acute respiratory infections in children and adolescents—results of a non-interventional study. Front. Pediatr. 10, 840008 (2022).
pubmed: 35425729 pmcid: 9001984 doi: 10.3389/fped.2022.840008
Shi, T. et al. Risk factors for mortality from severe community-acquired pneumonia in hospitalized children transferred to the pediatric intensive care unit. Pediatr. Neonatol. 61, 577–583 (2020).
pubmed: 32651007 doi: 10.1016/j.pedneo.2020.06.005
Dudnyk, V. & Pasik, V. Liver dysfunction in children with community-acquired pneumonia: the role of infectious and inflammatory markers. J. Educ. Health Sport 11, 169–181 (2021).
doi: 10.12775/JEHS.2021.11.11.015
Charpignon, M.-L. et al. Causal inference in medical records and complementary systems pharmacology for metformin drug repurposing towards dementia. Nat. Commun. 13, 7652 (2022).
pubmed: 36496454 pmcid: 9741618 doi: 10.1038/s41467-022-35157-w
Grief, S. N. & Loza, J. K. Guidelines for the evaluation and treatment of pneumonia. Prim. Care 45, 485–503 (2018).
pubmed: 30115336 pmcid: 7112285 doi: 10.1016/j.pop.2018.04.001
Paul, M. Corticosteroids for pneumonia. Cochrane Database Syst. Rev. 12, CD007720 (2017).
pubmed: 29236286
Sharma, A. & Kiciman, E. DoWhy: an end-to-end library for causal inference. Preprint at arXiv https://doi.org/10.48550/ARXIV.2011.04216 (2020).
Khilnani, G. C. et al. Guidelines for antibiotic prescription in intensive care unit. Indian J. Crit. Care Med. 23, S1–S63 (2019).
pubmed: 31516211 pmcid: 6734471 doi: 10.5005/jp-journals-10071-23101
Harris, L. K. & Crannage, A. J. Corticosteroids in community-acquired pneumonia: a review of current literature. J. Pharm. Technol. 37, 152–160 (2021).
pubmed: 34752553 pmcid: 8113662 doi: 10.1177/8755122521995587
Dou, L. et al. Decreased hospital length of stay with early administration of oseltamivir in patients hospitalized with influenza. Mayo Clin. Proc. Innov. Qual. Outcomes 4, 176–182 (2020).
pubmed: 32280928 pmcid: 7139986 doi: 10.1016/j.mayocpiqo.2019.12.005
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
pubmed: 30104762 pmcid: 6128408 doi: 10.1038/s41588-018-0183-z
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023).
pubmed: 36737450 pmcid: 9898515 doi: 10.1038/s41467-023-36231-7
Ko, F. et al. Associations with retinal pigment epithelium thickness measures in a large cohort: results from the UK Biobank. Ophthalmology 124, 105–117 (2017).
pubmed: 27720551 doi: 10.1016/j.ophtha.2016.07.033
Patel, P. J. et al. Spectral-domain optical coherence tomography imaging in 67 321 adults: associations with macular thickness in the UK Biobank study. Ophthalmology 123, 829–840 (2016).
pubmed: 26746598 doi: 10.1016/j.ophtha.2015.11.009
D’Agostino Sr, R. B. et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation 117, 743–753 (2008).
pubmed: 18212285 doi: 10.1161/CIRCULATIONAHA.107.699579
Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
pubmed: 36138150 pmcid: 9671812 doi: 10.1038/s41591-022-01980-3
Xu, Y. et al. An atlas of genetic scores to predict multi-omic traits. Nature 616, 123–131 (2023).
pubmed: 36991119 pmcid: 10323211 doi: 10.1038/s41586-023-05844-9
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
pubmed: 30936559 doi: 10.1038/s41587-019-0071-9
Rousan, L. A., Elobeid, E., Karrar, M. & Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 20, 245 (2020).
pubmed: 32933519 pmcid: 7491017 doi: 10.1186/s12890-020-01286-5
Signoroni, A. et al. BS-Net: learning COVID-19 pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 71, 102046 (2021).
pubmed: 33862337 pmcid: 8010334 doi: 10.1016/j.media.2021.102046
Bird, S. et al. Fairlearn: a toolkit for assessing and improving fairness in AI. https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ (2020).
Strack, B. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed. Res. Int. 2014, 781670 (2014).
pubmed: 24804245 pmcid: 3996476 doi: 10.1155/2014/781670
Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
pubmed: 22039212 doi: 10.1093/bioinformatics/btr597
Banerjee, A. et al. Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. Lancet Digit. Health 5, e370–e379 (2023).
pubmed: 37236697 doi: 10.1016/S2589-7500(23)00065-1
Nagamine, T. et al. Data-driven identification of heart failure disease states and progression pathways using electronic health records. Sci. Rep. 12, 17871 (2022).
pubmed: 36284167 pmcid: 9596465 doi: 10.1038/s41598-022-22398-4
Da Silva Filho, J. et al. Disease trajectories in hospitalized COVID-19 patients are predicted by clinical and peripheral blood signatures representing distinct lung pathologies. Preprint at bioRxiv https://doi.org/10.1101/2023.09.08.23295024 (2023).
Haneuse, S., Arterburn, D. & Daniels, M. J. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Netw. Open 4, e210184 (2021).
pubmed: 33635321 doi: 10.1001/jamanetworkopen.2021.0184
Little, R. J. A. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 83, 1198–1202 (1988).
doi: 10.1080/01621459.1988.10478722
Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med. Res. Methodol. 17, 162 (2017).
pubmed: 29207961 pmcid: 5717805 doi: 10.1186/s12874-017-0442-1
Dziura, J. D., Post, L. A., Zhao, Q., Fu, Z. & Peduzzi, P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J. Biol. Med. 86, 343–358 (2013).
pubmed: 24058309 pmcid: 3767219
White, I. R., Royston, P. & Wood, A. M. Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011).
pubmed: 21225900 doi: 10.1002/sim.4067
Jäger, S., Allhorn, A. & Bießmann, F. A benchmark for data imputation methods. Front. Big Data 4, 693674 (2021).
pubmed: 34308343 pmcid: 8297389 doi: 10.3389/fdata.2021.693674
Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, e002847 (2013).
pubmed: 23906948 pmcid: 3733317 doi: 10.1136/bmjopen-2013-002847
Ibrahim, J. G. & Molenberghs, G. Missing data methods in longitudinal studies: a review. Test (Madr.) 18, 1–43 (2009).
pubmed: 21218187
Li, C., Alsheikh, A. M., Robinson, K. A. & Lehmann, H. P. Use of recommended real-world methods for electronic health record data analysis has not improved over 10 years. Preprint at bioRxiv https://doi.org/10.1101/2023.06.21.23291706 (2023).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).
Speir, M. L. et al. UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37, 4578–4580 (2021).
pubmed: 34244710 pmcid: 8652023 doi: 10.1093/bioinformatics/btab503
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
doi: 10.1109/MCSE.2007.55
Waskom, M. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
doi: 10.21105/joss.03021
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
pubmed: 32939066 pmcid: 7759461 doi: 10.1038/s41586-020-2649-2
Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. of the Second Workshop on the LLVM Compiler Infrastructure in HPC. https://doi.org/10.1145/2833157.2833162 (Association for Computing Machinery, 2015).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
pubmed: 32015543 pmcid: 7056644 doi: 10.1038/s41592-019-0686-2
McKinney, W. Data structures for statistical computing in Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.). https://doi.org/10.25080/majora-92bf1922-00a (SciPy, 2010).
Boulanger, A. Open-source versus proprietary software: is one more reliable and secure than the other? IBM Syst. J. 44, 239–248 (2005).
doi: 10.1147/sj.442.0239
Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. In Proc. of the 14th Python in Science Conference. https://doi.org/10.25080/majora-7b98e3ed-013 (SciPy, 2015).
Pivarski, J. et al. Awkward Array. https://doi.org/10.5281/ZENODO.4341376
Collette, A. Python and HDF5: Unlocking Scientific Data (‘O’Reilly Media, Inc., 2013).
Miles, A. et al. zarr-developers/zarr-python: v2.13.6. https://doi.org/10.5281/zenodo.7541518 (2023).
The pandas development team. pandas-dev/pandas: Pandas. https://doi.org/10.5281/ZENODO.3509134 (2024).
Weberpals, J. et al. Deep learning-based propensity scores for confounding control in comparative effectiveness research: a large-scale, real-world data study. Epidemiology 32, 378–388 (2021).
pubmed: 33591049 doi: 10.1097/EDE.0000000000001338
Rosenthal, J. et al. Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology. Mol. Cancer Res. 20, 202–206 (2022).
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
pubmed: 35132262 doi: 10.1038/s41587-021-01206-w
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.). 8024–8035 (Curran Associates, 2019).
Frostig, R., Johnson, M. & Leary, C. Compiling machine learning programs via high-level tracing. https://cs.stanford.edu/~rfrostig/pubs/jax-mlsys2018.pdf (2018).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
pubmed: 37045921 doi: 10.1038/s41586-023-05881-4
Kraljevic, Z. et al. Multi-domain clinical natural language processing with MedCAT: the Medical Concept Annotation Toolkit. Artif. Intell. Med. 117, 102083 (2021).
pubmed: 34127232 doi: 10.1016/j.artmed.2021.102083
Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. An open source Python package for producing summary statistics for research papers. JAMIA Open 1, 26–31 (2018).
pubmed: 31984317 pmcid: 6951995 doi: 10.1093/jamiaopen/ooy012
Ellen, J. G. et al. Participant flow diagrams for health equity in AI. J. Biomed. Inform. 152, 104631 (2024).
pubmed: 38548006 doi: 10.1016/j.jbi.2024.104631
Schouten, R. M. & Vink, G. The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50, 1243–1258 (2021).
doi: 10.1177/0049124118799376
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
pubmed: 16632515 doi: 10.1093/biostatistics/kxj037
Davidson-Pilon, C. lifelines: survival analysis in Python. J. Open Source Softw. 4, 1317 (2019).
doi: 10.21105/joss.01317
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289–300 (1995).
doi: 10.1111/j.2517-6161.1995.tb02031.x
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
pubmed: 16381955 doi: 10.1093/nar/gkj067
Harrell, F. E. Jr, Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
pubmed: 7069920 doi: 10.1001/jama.1982.03320430047030
Currant, H. et al. Genetic variation affects morphological retinal phenotypes extracted from UK Biobank optical coherence tomography images. PLoS Genet. 17, e1009497 (2021).
pubmed: 33979322 pmcid: 8143408 doi: 10.1371/journal.pgen.1009497
Cohen, J. P. et al. TorchXRayVision: a library of chest X-ray datasets and models. In Proc. of the 5th International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.). 172, 231–249 (PMLR, 2022).
Cohen, J.P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated X-ray prediction. In Proceedings of Machine Learning Research, Vol. 121 (eds Arbel, T. et al.) 136–155 (PMLR, 2020).

Auteurs

Lukas Heumos (L)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

Philipp Ehmele (P)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.

Tim Treis (T)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

Julius Upmeier Zu Belzen (J)

Health Data Science Unit, Heidelberg University and BioQuant, Heidelberg, Germany.

Eljas Roellin (E)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.

Lilly May (L)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.

Altana Namsaraeva (A)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA), Darmstadt, Germany.

Nastassya Horlava (N)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

Vladimir A Shitov (VA)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

Xinyue Zhang (X)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.

Luke Zappia (L)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.

Rainer Knoll (R)

Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Bonn, Germany.

Niklas J Lang (NJ)

Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany.

Leon Hetzel (L)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.

Isaac Virshup (I)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.

Lisa Sikkema (L)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

Fabiola Curion (F)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.

Roland Eils (R)

Health Data Science Unit, Heidelberg University and BioQuant, Heidelberg, Germany.
Center for Digital Health, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany.

Herbert B Schiller (HB)

Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany.
Research Unit, Precision Regenerative Medicine (PRM), Helmholtz Munich, Munich, Germany.

Anne Hilgendorff (A)

Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany.
Center for Comprehensive Developmental Care (CDeCLMU) at the Social Pediatric Center, Dr. von Hauner Children's Hospital, LMU Hospital, Ludwig Maximilian University, Munich, Germany.

Fabian J Theis (FJ)

Institute of Computational Biology, Helmholtz Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.
Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany. fabian.theis@helmholtz-muenchen.de.

Classifications MeSH