A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.
COVID-19
Data harmonization
Data quality control
Distributed data network
Machine learning
Risk prediction
Journal
Computer methods and programs in biomedicine
ISSN: 1872-7565
Titre abrégé: Comput Methods Programs Biomed
Pays: Ireland
ID NLM: 8506513
Informations de publication
Date de publication:
Nov 2021
Nov 2021
Historique:
received:
18
03
2021
accepted:
30
08
2021
pubmed:
25
9
2021
medline:
29
10
2021
entrez:
24
9
2021
Statut:
ppublish
Résumé
As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.
Sections du résumé
BACKGROUND AND OBJECTIVE
OBJECTIVE
As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).
METHODS
METHODS
We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.
RESULTS
RESULTS
Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.
CONCLUSION
CONCLUSIONS
Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.
Identifiants
pubmed: 34560604
pii: S0169-2607(21)00468-5
doi: 10.1016/j.cmpb.2021.106394
pmc: PMC8420135
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
106394Informations de copyright
Copyright © 2021. Published by Elsevier B.V.
Déclaration de conflit d'intérêts
Declaration of Competing Interest CB, MJS, AGS, JMR are employees of Janssen Research & Development and shareholders of Johnson & Johnson.
Références
Crit Care. 2020 Mar 18;24(1):108
pubmed: 32188484
EGEMS (Wash DC). 2016 Nov 30;4(1):1239
pubmed: 28154833
Int J Obes (Lond). 2021 Nov;45(11):2347-2357
pubmed: 34267326
Nat Commun. 2020 Oct 6;11(1):5009
pubmed: 33024121
Lancet Rheumatol. 2020 Nov;2(11):e698-e711
pubmed: 32864627
Int J Med Inform. 2022 Jul;163:104762
pubmed: 35429722
BMJ. 2020 Apr 7;369:m1328
pubmed: 32265220
BMJ. 2021 May 11;373:n1038
pubmed: 33975825
Circulation. 2015 Jan 13;131(2):211-9
pubmed: 25561516
ACM Trans Model Comput Simul. 2013 Jan;23(1):
pubmed: 25328363
JMIR Med Inform. 2021 Apr 5;9(4):e21547
pubmed: 33661754
BMC Med Res Methodol. 2020 May 6;20(1):102
pubmed: 32375693
Pharmacoepidemiol Drug Saf. 2019 Jan 15;:
pubmed: 30648307
Rheumatology (Oxford). 2021 Oct 9;60(SI):SI37-SI50
pubmed: 33725121
Lancet Digit Health. 2021 Feb;3(2):e98-e114
pubmed: 33342753
Pediatrics. 2021 Sep;148(3):
pubmed: 34049958
BMC Med Res Methodol. 2022 Jan 30;22(1):35
pubmed: 35094685
Eur Rev Med Pharmacol Sci. 2020 Mar;24(6):3400-3403
pubmed: 32271458
BMJ. 2021 Jun 14;373:n1435
pubmed: 35727911
Rheumatology (Oxford). 2021 Jul 1;60(7):3222-3234
pubmed: 33367863
Am J Gastroenterol. 2021 Apr;116(4):692-699
pubmed: 33982938
PLoS One. 2020 Mar 19;15(3):e0230548
pubmed: 32191764
J Am Med Inform Assoc. 2018 Aug 1;25(8):969-975
pubmed: 29718407
Stat Med. 2016 Jan 30;35(2):214-26
pubmed: 26553135