Machine Learning Applied to Clinical Laboratory Data in Spain for COVID-19 Outcome Prediction: Model Development and Validation.


Journal

Journal of medical Internet research
ISSN: 1438-8871
Titre abrégé: J Med Internet Res
Pays: Canada
ID NLM: 100959882

Informations de publication

Date de publication:
14 04 2021
Historique:
received: 07 12 2020
accepted: 08 03 2021
revised: 29 12 2020
pubmed: 2 4 2021
medline: 29 4 2021
entrez: 1 4 2021
Statut: epublish

Résumé

The COVID-19 pandemic is probably the greatest health catastrophe of the modern era. Spain's health care system has been exposed to uncontrollable numbers of patients over a short period, causing the system to collapse. Given that diagnosis is not immediate, and there is no effective treatment for COVID-19, other tools have had to be developed to identify patients at the risk of severe disease complications and thus optimize material and human resources in health care. There are no tools to identify patients who have a worse prognosis than others. This study aimed to process a sample of electronic health records of patients with COVID-19 in order to develop a machine learning model to predict the severity of infection and mortality from among clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide more detailed insights into the disease. After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting algorithm was selected as the predictive method for this study. In addition, Shapley Additive Explanations was used to analyze the importance of the features of the resulting model. After data preprocessing, 1823 confirmed patients with COVID-19 and 32 predictor features were selected. On bootstrap validation, the extreme gradient boosting classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for the F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The 4 most relevant features for model prediction were lactate dehydrogenase activity, C-reactive protein levels, neutrophil counts, and urea levels. Our predictive model yielded excellent results in the differentiating among patients who died of COVID-19, primarily from among laboratory parameter values. Analysis of the resulting model identified a set of features with the most significant impact on the prediction, thus relating them to a higher risk of mortality.

Sections du résumé

BACKGROUND
The COVID-19 pandemic is probably the greatest health catastrophe of the modern era. Spain's health care system has been exposed to uncontrollable numbers of patients over a short period, causing the system to collapse. Given that diagnosis is not immediate, and there is no effective treatment for COVID-19, other tools have had to be developed to identify patients at the risk of severe disease complications and thus optimize material and human resources in health care. There are no tools to identify patients who have a worse prognosis than others.
OBJECTIVE
This study aimed to process a sample of electronic health records of patients with COVID-19 in order to develop a machine learning model to predict the severity of infection and mortality from among clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide more detailed insights into the disease.
METHODS
After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting algorithm was selected as the predictive method for this study. In addition, Shapley Additive Explanations was used to analyze the importance of the features of the resulting model.
RESULTS
After data preprocessing, 1823 confirmed patients with COVID-19 and 32 predictor features were selected. On bootstrap validation, the extreme gradient boosting classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for the F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The 4 most relevant features for model prediction were lactate dehydrogenase activity, C-reactive protein levels, neutrophil counts, and urea levels.
CONCLUSIONS
Our predictive model yielded excellent results in the differentiating among patients who died of COVID-19, primarily from among laboratory parameter values. Analysis of the resulting model identified a set of features with the most significant impact on the prediction, thus relating them to a higher risk of mortality.

Identifiants

pubmed: 33793407
pii: v23i4e26211
doi: 10.2196/26211
pmc: PMC8048712
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e26211

Informations de copyright

©Juan L Domínguez-Olmedo, Álvaro Gragera-Martínez, Jacinto Mata, Victoria Pachón Álvarez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 14.04.2021.

Références

Nat Mach Intell. 2020 Jan;2(1):56-67
pubmed: 32607472
Nat Commun. 2020 Sep 7;11(1):4439
pubmed: 32895375
N Engl J Med. 2020 Apr 30;382(18):e41
pubmed: 32212516
Ann Med. 2021 Dec;53(1):103-116
pubmed: 33063540
J Med Internet Res. 2020 Aug 4;22(8):e16903
pubmed: 32749223
Thromb Res. 2020 Jul;191:145-147
pubmed: 32291094
J Med Internet Res. 2020 Oct 6;22(10):e21439
pubmed: 32976111
Lancet Respir Med. 2020 Apr;8(4):e21
pubmed: 32171062
N Engl J Med. 2020 Apr 23;382(17):e38
pubmed: 32268022
Front Pharmacol. 2019 Oct 07;10:1155
pubmed: 31649533
J Med Internet Res. 2020 Nov 6;22(11):e24018
pubmed: 33027032
Eur Respir J. 1996 Aug;9(8):1736-42
pubmed: 8866602
Healthcare (Basel). 2020 Jul 31;8(3):
pubmed: 32751894
BMJ. 2020 Feb 18;368:m641
pubmed: 32071063
Diagnostics (Basel). 2019 Nov 07;9(4):
pubmed: 31703364
Clin Chem Lab Med. 2020 Jun 25;58(7):1021-1028
pubmed: 32286245
Diabetes Metab J. 2020 Apr;44(2):349-353
pubmed: 32347027

Auteurs

Juan L Domínguez-Olmedo (JL)

Higher Technical School of Engineering, University of Huelva, Huelva, Spain.

Álvaro Gragera-Martínez (Á)

Juan Ramón Jiménez University Hospital, Huelva, Spain.

Jacinto Mata (J)

Higher Technical School of Engineering, University of Huelva, Huelva, Spain.

Victoria Pachón Álvarez (V)

Higher Technical School of Engineering, University of Huelva, Huelva, Spain.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH