An open-source framework for end-to-end analysis of electronic health record data.
Journal
Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015
Informations de publication
Date de publication:
12 Sep 2024
12 Sep 2024
Historique:
received:
11
12
2023
accepted:
25
07
2024
medline:
13
9
2024
pubmed:
13
9
2024
entrez:
12
9
2024
Statut:
aheadofprint
Résumé
With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models, paving the way for foundational models in biomedical research. We demonstrate ehrapy's features in six distinct examples. We applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we reveal biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. We reconstructed disease state trajectories in patients with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) based on imaging data. Finally, we conducted a case study to demonstrate how ehrapy can detect and mitigate biases in EHR data. ehrapy, thus, provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.
Identifiants
pubmed: 39266748
doi: 10.1038/s41591-024-03214-0
pii: 10.1038/s41591-024-03214-0
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© 2024. The Author(s).
Références
Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215–E220 (2000).
pubmed: 10851218
doi: 10.1161/01.CIR.101.23.e215
Atasoy, H., Greenwood, B. N. & McCullough, J. S. The digitization of patient care: a review of the effects of electronic health records on health care quality and utilization. Annu. Rev. Public Health 40, 487–500 (2019).
pubmed: 30566385
doi: 10.1146/annurev-publhealth-040218-044206
Jamoom, E. W., Patel, V., Furukawa, M. F. & King, J. EHR adopters vs. non-adopters: impacts of, barriers to, and federal initiatives for EHR adoption. Health (Amst.) 2, 33–39 (2014).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 18 (2018).
pubmed: 31304302
pmcid: 6550175
doi: 10.1038/s41746-018-0029-1
Wolf, A. et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. Int. J. Epidemiol. 48, 1740–1740g (2019).
pubmed: 30859197
pmcid: 6929522
doi: 10.1093/ije/dyz034
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
pubmed: 25826379
pmcid: 4380465
doi: 10.1371/journal.pmed.1001779
Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5, 180178 (2018).
pubmed: 30204154
pmcid: 6132188
doi: 10.1038/sdata.2018.178
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
pubmed: 27219127
pmcid: 4878278
doi: 10.1038/sdata.2016.35
Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26, 364–373 (2020).
pubmed: 32152583
doi: 10.1038/s41591-020-0789-4
Rasmy, L. et al. Recurrent neural network models (CovRNN) for predicting outcomes of patients with COVID-19 on admission to hospital: model development and validation using electronic health record data. Lancet Digit. Health 4, e415–e425 (2022).
pubmed: 35466079
pmcid: 9023005
doi: 10.1016/S2589-7500(22)00049-8
Marcus, J. L. et al. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 6, e688–e695 (2019).
pubmed: 31285183
pmcid: 7152802
doi: 10.1016/S2352-3018(19)30137-7
Kruse, C. S., Stein, A., Thomas, H. & Kaur, H. The use of electronic health records to support population health: a systematic review of the literature. J. Med. Syst. 42, 214 (2018).
pubmed: 30269237
pmcid: 6182727
doi: 10.1007/s10916-018-1075-6
Sheikh, A., Jha, A., Cresswell, K., Greaves, F. & Bates, D. W. Adoption of electronic health records in UK hospitals: lessons from the USA. Lancet 384, 8–9 (2014).
pubmed: 24998803
doi: 10.1016/S0140-6736(14)61099-0
Sheikh, A. et al. Health information technology and digital innovation for national learning health and care systems. Lancet Digit. Health 3, e383–e396 (2021).
pubmed: 33967002
doi: 10.1016/S2589-7500(21)00005-4
Cord, K. A. M., Mc Cord, K. A. & Hemkens, L. G. Using electronic health records for clinical trials: where do we stand and where can we go? Can. Med. Assoc. J. 191, E128–E133 (2019).
doi: 10.1503/cmaj.180841
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).
pubmed: 32699826
pmcid: 7367859
doi: 10.1038/s41746-020-0301-z
Ayaz, M., Pasha, M. F., Alzahrani, M. Y., Budiarto, R. & Stiawan, D. The Fast Health Interoperability Resources (FHIR) standard: systematic literature review of implementations, applications, challenges and opportunities. JMIR Med. Inform. 9, e21929 (2021).
pubmed: 34328424
pmcid: 8367140
doi: 10.2196/21929
Peskoe, S. B. et al. Adjusting for selection bias due to missing data in electronic health records-based research. Stat. Methods Med. Res. 30, 2221–2238 (2021).
pubmed: 34445911
pmcid: 10942747
doi: 10.1177/09622802211027601
Haneuse, S. & Daniels, M. A general framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash. DC) 4, 1203 (2016).
pubmed: 27668265
Gallifant, J. et al. Disparity dashboards: an evaluation of the literature and framework for health equity improvement. Lancet Digit. Health 5, e831–e839 (2023).
pubmed: 37890905
pmcid: 10639125
doi: 10.1016/S2589-7500(23)00150-4
Sauer, C. M. et al. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit. Health 4, e893–e898 (2022).
pubmed: 36154811
doi: 10.1016/S2589-7500(22)00154-6
Li, J. et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit. Med. 4, 147 (2021).
pubmed: 34635760
pmcid: 8505441
doi: 10.1038/s41746-021-00518-0
Rubin, D. B. Inference and missing data. Biometrika 63, 581 (1976).
doi: 10.1093/biomet/63.3.581
Scheid, L. M., Brown, L. S., Clark, C. & Rosenfeld, C. R. Data electronically extracted from the electronic health record require validation. J. Perinatol. 39, 468–474 (2019).
pubmed: 30679823
doi: 10.1038/s41372-018-0311-8
Phelan, M., Bhavsar, N. A. & Goldstein, B. A. Illustrating informed presence bias in electronic health records data: how patient interactions with a health system can impact inference. EGEMS (Wash. DC). 5, 22 (2017).
pubmed: 29930963
pmcid: 5994954
Secondary Analysis of Electronic Health Records (ed MIT Critical Data) (Springer, 2016).
Jetley, G. & Zhang, H. Electronic health records in IS research: quality issues, essential thresholds and remedial actions. Decis. Support Syst. 126, 113137 (2019).
doi: 10.1016/j.dss.2019.113137
McCormack, J. P. & Holmes, D. T. Your results may vary: the imprecision of medical measurements. BMJ 368, m149 (2020).
pubmed: 32079593
doi: 10.1136/bmj.m149
Hobbs, F. D. et al. Is the international normalised ratio (INR) reliable? A trial of comparative measurements in hospital laboratory and primary care settings. J. Clin. Pathol. 52, 494–497 (1999).
pubmed: 10605400
pmcid: 501488
doi: 10.1136/jcp.52.7.494
Huguet, N. et al. Using electronic health records in longitudinal studies: estimating patient attrition. Med. Care 58 Suppl 6 Suppl 1, S46–S52 (2020).
Zeng, J., Gensheimer, M. F., Rubin, D. L., Athey, S. & Shachter, R. D. Uncovering interpretable potential confounders in electronic medical records. Nat. Commun. 13, 1014 (2022).
pubmed: 35197467
pmcid: 8866497
doi: 10.1038/s41467-022-28546-8
Getzen, E., Ungar, L., Mowery, D., Jiang, X. & Long, Q. Mining for equitable health: assessing the impact of missing data in electronic health records. J. Biomed. Inform. 139, 104269 (2023).
pubmed: 36621750
pmcid: 10391553
doi: 10.1016/j.jbi.2022.104269
Tang, S. et al. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J. Am. Med. Inform. Assoc. 27, 1921–1934 (2020).
pubmed: 33040151
pmcid: 7727385
doi: 10.1093/jamia/ocaa139
Dagliati, A. et al. A process mining pipeline to characterize COVID-19 patients’ trajectories and identify relevant temporal phenotypes from EHR data. Front. Public Health 10, 815674 (2022).
pubmed: 35677768
pmcid: 9168006
doi: 10.3389/fpubh.2022.815674
Sun, Y. & Zhou, Y.-H. A machine learning pipeline for mortality prediction in the ICU. Int. J. Digit. Health 2, 3 (2022).
doi: 10.29337/ijdh.44
Mandyam, A., Yoo, E. C., Soules, J., Laudanski, K. & Engelhardt, B. E. COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks. In Proc. of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3459930.3469536 (Association for Computing Machinery, 2021).
Gao, C. A. et al. A machine learning approach identifies unresolving secondary pneumonia as a contributor to mortality in patients with severe pneumonia, including COVID-19. J. Clin. Invest. 133, e170682 (2023).
Makam, A. N. et al. The good, the bad and the early adopters: providers’ attitudes about a common, commercial EHR. J. Eval. Clin. Pract. 20, 36–42 (2014).
pubmed: 23962319
doi: 10.1111/jep.12076
Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17, 137–145 (2020).
pubmed: 31792435
doi: 10.1038/s41592-019-0654-x
Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41, 604–606 (2023).
pubmed: 37037904
doi: 10.1038/s41587-023-01733-8
Zou, Q. et al. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9, 515 (2018).
pubmed: 30459809
pmcid: 6232260
doi: 10.3389/fgene.2018.00515
Cios, K. J. & William Moore, G. Uniqueness of medical data mining. Artif. Intell. Med. 26, 1–24 (2002).
pubmed: 12234714
doi: 10.1016/S0933-3657(02)00049-0
Zeng, X. et al. PIC, a paediatric-specific intensive care database. Sci. Data 7, 14 (2020).
pubmed: 31932583
pmcid: 6957490
doi: 10.1038/s41597-020-0355-4
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
pubmed: 30305743
pmcid: 6786975
doi: 10.1038/s41586-018-0579-z
Lee, J. et al. Open-access MIMIC-II database for intensive care research. Annu. Int. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2011, 8315–8318 (2011).
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 22, 553–564 (2015).
pubmed: 25670757
pmcid: 4457111
doi: 10.1093/jamia/ocu023
Vasilevsky, N. A. et al. Mondo: unifying diseases for the world, by the world. Preprint at medRxiv https://doi.org/10.1101/2022.04.13.22273750 (2022).
Harrison, J. E., Weber, S., Jakob, R. & Chute, C. G. ICD-11: an international classification of diseases for the twenty-first century. BMC Med. Inform. Decis. Mak. 21, 206 (2021).
pubmed: 34753471
pmcid: 8577172
doi: 10.1186/s12911-021-01534-6
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
pubmed: 30476213
doi: 10.1093/nar/gky1105
Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).
pubmed: 31553307
pmcid: 6911227
doi: 10.2196/14325
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
pubmed: 29409532
pmcid: 5802054
doi: 10.1186/s13059-017-1382-0
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
de Haan-Rietdijk, S., de Haan-Rietdijk, S., Kuppens, P. & Hamaker, E. L. What’s in a day? A guide to decomposing the variance in intensive longitudinal data. Front. Psychol. 7, 891 (2016).
pubmed: 27378986
pmcid: 4906027
Pedersen, E. S. L., Danquah, I. H., Petersen, C. B. & Tolstrup, J. S. Intra-individual variability in day-to-day and month-to-month measurements of physical activity and sedentary behaviour at work and in leisure-time among Danish adults. BMC Public Health 16, 1222 (2016).
pubmed: 27914468
pmcid: 5135790
doi: 10.1186/s12889-016-3890-3
Roffey, D. M., Byrne, N. M. & Hills, A. P. Day-to-day variance in measurement of resting metabolic rate using ventilated-hood and mouthpiece & nose-clip indirect calorimetry systems. JPEN J. Parenter. Enter. Nutr. 30, 426–432 (2006).
doi: 10.1177/0148607106030005426
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
pubmed: 27571553
doi: 10.1038/nmeth.3971
Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Methods 19, 159–170 (2022).
pubmed: 35027767
pmcid: 8828480
doi: 10.1038/s41592-021-01346-6
Weiler, P., Lange, M., Klein, M., Pe'er, D. & Theis, F. CellRank 2: unified fate mapping in multiview single-cell data. Nat. Methods 21, 1196–1205 (2024).
Zhang, S. et al. Cost of management of severe pneumonia in young children: systematic analysis. J. Glob. Health 6, 010408 (2016).
pubmed: 27231544
pmcid: 4871066
doi: 10.7189/jogh.06.010408
Torres, A. et al. Pneumonia. Nat. Rev. Dis. Prim. 7, 25 (2021).
pubmed: 33833230
doi: 10.1038/s41572-021-00259-0
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
pubmed: 30914743
pmcid: 6435756
doi: 10.1038/s41598-019-41695-z
Kamin, W. et al. Liver involvement in acute respiratory infections in children and adolescents—results of a non-interventional study. Front. Pediatr. 10, 840008 (2022).
pubmed: 35425729
pmcid: 9001984
doi: 10.3389/fped.2022.840008
Shi, T. et al. Risk factors for mortality from severe community-acquired pneumonia in hospitalized children transferred to the pediatric intensive care unit. Pediatr. Neonatol. 61, 577–583 (2020).
pubmed: 32651007
doi: 10.1016/j.pedneo.2020.06.005
Dudnyk, V. & Pasik, V. Liver dysfunction in children with community-acquired pneumonia: the role of infectious and inflammatory markers. J. Educ. Health Sport 11, 169–181 (2021).
doi: 10.12775/JEHS.2021.11.11.015
Charpignon, M.-L. et al. Causal inference in medical records and complementary systems pharmacology for metformin drug repurposing towards dementia. Nat. Commun. 13, 7652 (2022).
pubmed: 36496454
pmcid: 9741618
doi: 10.1038/s41467-022-35157-w
Grief, S. N. & Loza, J. K. Guidelines for the evaluation and treatment of pneumonia. Prim. Care 45, 485–503 (2018).
pubmed: 30115336
pmcid: 7112285
doi: 10.1016/j.pop.2018.04.001
Paul, M. Corticosteroids for pneumonia. Cochrane Database Syst. Rev. 12, CD007720 (2017).
pubmed: 29236286
Sharma, A. & Kiciman, E. DoWhy: an end-to-end library for causal inference. Preprint at arXiv https://doi.org/10.48550/ARXIV.2011.04216 (2020).
Khilnani, G. C. et al. Guidelines for antibiotic prescription in intensive care unit. Indian J. Crit. Care Med. 23, S1–S63 (2019).
pubmed: 31516211
pmcid: 6734471
doi: 10.5005/jp-journals-10071-23101
Harris, L. K. & Crannage, A. J. Corticosteroids in community-acquired pneumonia: a review of current literature. J. Pharm. Technol. 37, 152–160 (2021).
pubmed: 34752553
pmcid: 8113662
doi: 10.1177/8755122521995587
Dou, L. et al. Decreased hospital length of stay with early administration of oseltamivir in patients hospitalized with influenza. Mayo Clin. Proc. Innov. Qual. Outcomes 4, 176–182 (2020).
pubmed: 32280928
pmcid: 7139986
doi: 10.1016/j.mayocpiqo.2019.12.005
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
pubmed: 30104762
pmcid: 6128408
doi: 10.1038/s41588-018-0183-z
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023).
pubmed: 36737450
pmcid: 9898515
doi: 10.1038/s41467-023-36231-7
Ko, F. et al. Associations with retinal pigment epithelium thickness measures in a large cohort: results from the UK Biobank. Ophthalmology 124, 105–117 (2017).
pubmed: 27720551
doi: 10.1016/j.ophtha.2016.07.033
Patel, P. J. et al. Spectral-domain optical coherence tomography imaging in 67 321 adults: associations with macular thickness in the UK Biobank study. Ophthalmology 123, 829–840 (2016).
pubmed: 26746598
doi: 10.1016/j.ophtha.2015.11.009
D’Agostino Sr, R. B. et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation 117, 743–753 (2008).
pubmed: 18212285
doi: 10.1161/CIRCULATIONAHA.107.699579
Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
pubmed: 36138150
pmcid: 9671812
doi: 10.1038/s41591-022-01980-3
Xu, Y. et al. An atlas of genetic scores to predict multi-omic traits. Nature 616, 123–131 (2023).
pubmed: 36991119
pmcid: 10323211
doi: 10.1038/s41586-023-05844-9
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
pubmed: 30936559
doi: 10.1038/s41587-019-0071-9
Rousan, L. A., Elobeid, E., Karrar, M. & Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 20, 245 (2020).
pubmed: 32933519
pmcid: 7491017
doi: 10.1186/s12890-020-01286-5
Signoroni, A. et al. BS-Net: learning COVID-19 pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 71, 102046 (2021).
pubmed: 33862337
pmcid: 8010334
doi: 10.1016/j.media.2021.102046
Bird, S. et al. Fairlearn: a toolkit for assessing and improving fairness in AI. https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ (2020).
Strack, B. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed. Res. Int. 2014, 781670 (2014).
pubmed: 24804245
pmcid: 3996476
doi: 10.1155/2014/781670
Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
pubmed: 22039212
doi: 10.1093/bioinformatics/btr597
Banerjee, A. et al. Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. Lancet Digit. Health 5, e370–e379 (2023).
pubmed: 37236697
doi: 10.1016/S2589-7500(23)00065-1
Nagamine, T. et al. Data-driven identification of heart failure disease states and progression pathways using electronic health records. Sci. Rep. 12, 17871 (2022).
pubmed: 36284167
pmcid: 9596465
doi: 10.1038/s41598-022-22398-4
Da Silva Filho, J. et al. Disease trajectories in hospitalized COVID-19 patients are predicted by clinical and peripheral blood signatures representing distinct lung pathologies. Preprint at bioRxiv https://doi.org/10.1101/2023.09.08.23295024 (2023).
Haneuse, S., Arterburn, D. & Daniels, M. J. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Netw. Open 4, e210184 (2021).
pubmed: 33635321
doi: 10.1001/jamanetworkopen.2021.0184
Little, R. J. A. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 83, 1198–1202 (1988).
doi: 10.1080/01621459.1988.10478722
Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med. Res. Methodol. 17, 162 (2017).
pubmed: 29207961
pmcid: 5717805
doi: 10.1186/s12874-017-0442-1
Dziura, J. D., Post, L. A., Zhao, Q., Fu, Z. & Peduzzi, P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J. Biol. Med. 86, 343–358 (2013).
pubmed: 24058309
pmcid: 3767219
White, I. R., Royston, P. & Wood, A. M. Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011).
pubmed: 21225900
doi: 10.1002/sim.4067
Jäger, S., Allhorn, A. & Bießmann, F. A benchmark for data imputation methods. Front. Big Data 4, 693674 (2021).
pubmed: 34308343
pmcid: 8297389
doi: 10.3389/fdata.2021.693674
Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, e002847 (2013).
pubmed: 23906948
pmcid: 3733317
doi: 10.1136/bmjopen-2013-002847
Ibrahim, J. G. & Molenberghs, G. Missing data methods in longitudinal studies: a review. Test (Madr.) 18, 1–43 (2009).
pubmed: 21218187
Li, C., Alsheikh, A. M., Robinson, K. A. & Lehmann, H. P. Use of recommended real-world methods for electronic health record data analysis has not improved over 10 years. Preprint at bioRxiv https://doi.org/10.1101/2023.06.21.23291706 (2023).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).
Speir, M. L. et al. UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37, 4578–4580 (2021).
pubmed: 34244710
pmcid: 8652023
doi: 10.1093/bioinformatics/btab503
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
doi: 10.1109/MCSE.2007.55
Waskom, M. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
doi: 10.21105/joss.03021
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
pubmed: 32939066
pmcid: 7759461
doi: 10.1038/s41586-020-2649-2
Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. of the Second Workshop on the LLVM Compiler Infrastructure in HPC. https://doi.org/10.1145/2833157.2833162 (Association for Computing Machinery, 2015).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
pubmed: 32015543
pmcid: 7056644
doi: 10.1038/s41592-019-0686-2
McKinney, W. Data structures for statistical computing in Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.). https://doi.org/10.25080/majora-92bf1922-00a (SciPy, 2010).
Boulanger, A. Open-source versus proprietary software: is one more reliable and secure than the other? IBM Syst. J. 44, 239–248 (2005).
doi: 10.1147/sj.442.0239
Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. In Proc. of the 14th Python in Science Conference. https://doi.org/10.25080/majora-7b98e3ed-013 (SciPy, 2015).
Pivarski, J. et al. Awkward Array. https://doi.org/10.5281/ZENODO.4341376
Collette, A. Python and HDF5: Unlocking Scientific Data (‘O’Reilly Media, Inc., 2013).
Miles, A. et al. zarr-developers/zarr-python: v2.13.6. https://doi.org/10.5281/zenodo.7541518 (2023).
The pandas development team. pandas-dev/pandas: Pandas. https://doi.org/10.5281/ZENODO.3509134 (2024).
Weberpals, J. et al. Deep learning-based propensity scores for confounding control in comparative effectiveness research: a large-scale, real-world data study. Epidemiology 32, 378–388 (2021).
pubmed: 33591049
doi: 10.1097/EDE.0000000000001338
Rosenthal, J. et al. Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology. Mol. Cancer Res. 20, 202–206 (2022).
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
pubmed: 35132262
doi: 10.1038/s41587-021-01206-w
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.). 8024–8035 (Curran Associates, 2019).
Frostig, R., Johnson, M. & Leary, C. Compiling machine learning programs via high-level tracing. https://cs.stanford.edu/~rfrostig/pubs/jax-mlsys2018.pdf (2018).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
pubmed: 37045921
doi: 10.1038/s41586-023-05881-4
Kraljevic, Z. et al. Multi-domain clinical natural language processing with MedCAT: the Medical Concept Annotation Toolkit. Artif. Intell. Med. 117, 102083 (2021).
pubmed: 34127232
doi: 10.1016/j.artmed.2021.102083
Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. An open source Python package for producing summary statistics for research papers. JAMIA Open 1, 26–31 (2018).
pubmed: 31984317
pmcid: 6951995
doi: 10.1093/jamiaopen/ooy012
Ellen, J. G. et al. Participant flow diagrams for health equity in AI. J. Biomed. Inform. 152, 104631 (2024).
pubmed: 38548006
doi: 10.1016/j.jbi.2024.104631
Schouten, R. M. & Vink, G. The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50, 1243–1258 (2021).
doi: 10.1177/0049124118799376
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
pubmed: 16632515
doi: 10.1093/biostatistics/kxj037
Davidson-Pilon, C. lifelines: survival analysis in Python. J. Open Source Softw. 4, 1317 (2019).
doi: 10.21105/joss.01317
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289–300 (1995).
doi: 10.1111/j.2517-6161.1995.tb02031.x
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
pubmed: 16381955
doi: 10.1093/nar/gkj067
Harrell, F. E. Jr, Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
pubmed: 7069920
doi: 10.1001/jama.1982.03320430047030
Currant, H. et al. Genetic variation affects morphological retinal phenotypes extracted from UK Biobank optical coherence tomography images. PLoS Genet. 17, e1009497 (2021).
pubmed: 33979322
pmcid: 8143408
doi: 10.1371/journal.pgen.1009497
Cohen, J. P. et al. TorchXRayVision: a library of chest X-ray datasets and models. In Proc. of the 5th International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.). 172, 231–249 (PMLR, 2022).
Cohen, J.P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated X-ray prediction. In Proceedings of Machine Learning Research, Vol. 121 (eds Arbel, T. et al.) 136–155 (PMLR, 2020).