Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods.

COVID-19 Humans Logistic Models Machine Learning ROC Curve SARS-CoV-2

COVID-19 mortality Classification trees Extreme gradient boosting Generalized additive model Predictive model

Journal

BMC medical research methodology

ISSN: 1471-2288

Titre abrégé: BMC Med Res Methodol

Pays: England

ID NLM: 100968545

Informations de publication

Date de publication:
27 11 2021

Historique:

received: 23 04 2021

accepted: 14 10 2021

entrez: 28 11 2021

pubmed: 29 11 2021

medline: 15 12 2021

Statut: epublish

Résumé

Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system's burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier's score, calibration intercept and calibration slope. We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier's scores. XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.

Sections du résumé

BACKGROUND

METHODS

We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier's score, calibration intercept and calibration slope.

RESULTS

We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier's scores.

CONCLUSIONS

XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.

Identifiants

DOI: 10.1186/s12874-021-01441-4 PMID: 34837951 PMC: PMC8627169

pubmed: 34837951

doi: 10.1186/s12874-021-01441-4

pii: 10.1186/s12874-021-01441-4

pmc: PMC8627169

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

267

Informations de copyright

Références

J Stat Softw. 2010;33(1):1-22

pubmed: 20808728

PeerJ. 2020 Sep 28;8:e10083

pubmed: 33062451

BMJ Open. 2021 Feb 17;11(2):e043863

pubmed: 33597143

BMJ Open. 2020 Feb 25;10(2):e033898

pubmed: 32102816

Stat Med. 2005 Oct 15;24(19):3019-35

pubmed: 16149128

JAMA Netw Open. 2020 Jun 1;3(6):e2011834

pubmed: 32525550

Epidemiology. 2010 Jan;21(1):128-38

pubmed: 20010215

Stat Med. 2007 Jul 10;26(15):2937-57

pubmed: 17186501

Sci Rep. 2021 Feb 18;11(1):4200

pubmed: 33603086

J Clin Epidemiol. 2010 Aug;63(8):938-9; author reply 939

pubmed: 20189763

Eur Respir J. 2020 May 14;55(5):

pubmed: 32217650

PLoS One. 2021 Feb 4;16(2):e0246306

pubmed: 33539390

J Clin Oncol. 2005 Jul 1;23(19):4322-9

pubmed: 15781880

Front Med (Lausanne). 2020 Aug 11;7:445

pubmed: 32903618

Predicting COVID-19 mortality risk in Toronto, Canada: a comparison of tree-based and regression-based machine learning methods.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Cindy Feng (C)

George Kephart (G)

Elizabeth Juarez-Colunga (E)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH