A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

COVID-19 Humans Logistic Models Machine Learning Pandemics SARS-CoV-2

COVID-19 Data harmonization Data quality control Distributed data network Machine learning Risk prediction

Journal

Computer methods and programs in biomedicine

ISSN: 1872-7565

Titre abrégé: Comput Methods Programs Biomed

Pays: Ireland

ID NLM: 8506513

Informations de publication

Date de publication:
Nov 2021

Historique:

received: 18 03 2021

accepted: 30 08 2021

pubmed: 25 9 2021

medline: 29 10 2021

entrez: 24 9 2021

Statut: ppublish

Résumé

As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Sections du résumé

BACKGROUND AND OBJECTIVE OBJECTIVE

METHODS METHODS

We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.

RESULTS RESULTS

Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.

CONCLUSION CONCLUSIONS

Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Identifiants

DOI: 10.1016/j.cmpb.2021.106394 PMID: 34560604 PMC: PMC8420135

pubmed: 34560604

pii: S0169-2607(21)00468-5

doi: 10.1016/j.cmpb.2021.106394

pmc: PMC8420135

pii:

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

106394

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest CB, MJS, AGS, JMR are employees of Janssen Research & Development and shareholders of Johnson & Johnson.

Références

Crit Care. 2020 Mar 18;24(1):108

pubmed: 32188484

EGEMS (Wash DC). 2016 Nov 30;4(1):1239

pubmed: 28154833

Int J Obes (Lond). 2021 Nov;45(11):2347-2357

pubmed: 34267326

Nat Commun. 2020 Oct 6;11(1):5009

pubmed: 33024121

Lancet Rheumatol. 2020 Nov;2(11):e698-e711

pubmed: 32864627

Int J Med Inform. 2022 Jul;163:104762

pubmed: 35429722

BMJ. 2020 Apr 7;369:m1328

pubmed: 32265220

BMJ. 2021 May 11;373:n1038

pubmed: 33975825

Circulation. 2015 Jan 13;131(2):211-9

pubmed: 25561516

ACM Trans Model Comput Simul. 2013 Jan;23(1):

pubmed: 25328363

JMIR Med Inform. 2021 Apr 5;9(4):e21547

pubmed: 33661754

BMC Med Res Methodol. 2020 May 6;20(1):102

pubmed: 32375693

Pharmacoepidemiol Drug Saf. 2019 Jan 15;:

pubmed: 30648307

Rheumatology (Oxford). 2021 Oct 9;60(SI):SI37-SI50

pubmed: 33725121

Lancet Digit Health. 2021 Feb;3(2):e98-e114

pubmed: 33342753

Pediatrics. 2021 Sep;148(3):

pubmed: 34049958

BMC Med Res Methodol. 2022 Jan 30;22(1):35

pubmed: 35094685

Eur Rev Med Pharmacol Sci. 2020 Mar;24(6):3400-3403

pubmed: 32271458

BMJ. 2021 Jun 14;373:n1435

pubmed: 35727911

Rheumatology (Oxford). 2021 Jul 1;60(7):3222-3234

pubmed: 33367863

Am J Gastroenterol. 2021 Apr;116(4):692-699

pubmed: 33982938

PLoS One. 2020 Mar 19;15(3):e0230548

pubmed: 32191764

J Am Med Inform Assoc. 2018 Aug 1;25(8):969-975

pubmed: 29718407

Stat Med. 2016 Jan 30;35(2):214-26

pubmed: 26553135

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Références

Auteurs

Sara Khalid (S)

Cynthia Yang (C)

Clair Blacketer (C)

Talita Duarte-Salles (T)

Sergio Fernández-Bertolín (S)

Chungsoo Kim (C)

Rae Woong Park (RW)

Jimyung Park (J)

Martijn J Schuemie (MJ)

Anthony G Sena (AG)

Marc A Suchard (MA)

Seng Chan You (SC)

Peter R Rijnbeek (PR)

Jenna M Reps (JM)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH