A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

COVID-19 Data harmonization Data quality control Distributed data network Machine learning Risk prediction

Journal

Computer methods and programs in biomedicine
ISSN: 1872-7565
Titre abrégé: Comput Methods Programs Biomed
Pays: Ireland
ID NLM: 8506513

Informations de publication

Date de publication:
Nov 2021
Historique:
received: 18 03 2021
accepted: 30 08 2021
pubmed: 25 9 2021
medline: 29 10 2021
entrez: 24 9 2021
Statut: ppublish

Résumé

As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Sections du résumé

BACKGROUND AND OBJECTIVE OBJECTIVE
As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).
METHODS METHODS
We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.
RESULTS RESULTS
Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.
CONCLUSION CONCLUSIONS
Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Identifiants

pubmed: 34560604
pii: S0169-2607(21)00468-5
doi: 10.1016/j.cmpb.2021.106394
pmc: PMC8420135
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

106394

Informations de copyright

Copyright © 2021. Published by Elsevier B.V.

Déclaration de conflit d'intérêts

Declaration of Competing Interest CB, MJS, AGS, JMR are employees of Janssen Research & Development and shareholders of Johnson & Johnson.

Références

Crit Care. 2020 Mar 18;24(1):108
pubmed: 32188484
EGEMS (Wash DC). 2016 Nov 30;4(1):1239
pubmed: 28154833
Int J Obes (Lond). 2021 Nov;45(11):2347-2357
pubmed: 34267326
Nat Commun. 2020 Oct 6;11(1):5009
pubmed: 33024121
Lancet Rheumatol. 2020 Nov;2(11):e698-e711
pubmed: 32864627
Int J Med Inform. 2022 Jul;163:104762
pubmed: 35429722
BMJ. 2020 Apr 7;369:m1328
pubmed: 32265220
BMJ. 2021 May 11;373:n1038
pubmed: 33975825
Circulation. 2015 Jan 13;131(2):211-9
pubmed: 25561516
ACM Trans Model Comput Simul. 2013 Jan;23(1):
pubmed: 25328363
JMIR Med Inform. 2021 Apr 5;9(4):e21547
pubmed: 33661754
BMC Med Res Methodol. 2020 May 6;20(1):102
pubmed: 32375693
Pharmacoepidemiol Drug Saf. 2019 Jan 15;:
pubmed: 30648307
Rheumatology (Oxford). 2021 Oct 9;60(SI):SI37-SI50
pubmed: 33725121
Lancet Digit Health. 2021 Feb;3(2):e98-e114
pubmed: 33342753
Pediatrics. 2021 Sep;148(3):
pubmed: 34049958
BMC Med Res Methodol. 2022 Jan 30;22(1):35
pubmed: 35094685
Eur Rev Med Pharmacol Sci. 2020 Mar;24(6):3400-3403
pubmed: 32271458
BMJ. 2021 Jun 14;373:n1435
pubmed: 35727911
Rheumatology (Oxford). 2021 Jul 1;60(7):3222-3234
pubmed: 33367863
Am J Gastroenterol. 2021 Apr;116(4):692-699
pubmed: 33982938
PLoS One. 2020 Mar 19;15(3):e0230548
pubmed: 32191764
J Am Med Inform Assoc. 2018 Aug 1;25(8):969-975
pubmed: 29718407
Stat Med. 2016 Jan 30;35(2):214-26
pubmed: 26553135

Auteurs

Sara Khalid (S)

Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.

Cynthia Yang (C)

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

Clair Blacketer (C)

Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.

Talita Duarte-Salles (T)

Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.

Sergio Fernández-Bertolín (S)

Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.

Chungsoo Kim (C)

Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.

Rae Woong Park (RW)

Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea; Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea.

Jimyung Park (J)

Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.

Martijn J Schuemie (MJ)

Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.

Anthony G Sena (AG)

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.

Marc A Suchard (MA)

Departments of Biomathematics, University of California, Los Angeles, USA.

Seng Chan You (SC)

Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Republic of Korea.

Peter R Rijnbeek (PR)

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

Jenna M Reps (JM)

Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA. Electronic address: jreps@its.jnj.com.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH