Machine Learning Applied to Clinical Laboratory Data in Spain for COVID-19 Outcome Prediction: Model Development and Validation.
Adolescent
Adult
Aged
Aged, 80 and over
COVID-19
/ epidemiology
Child
Child, Preschool
Female
Humans
Infant
Infant, Newborn
Laboratories
/ standards
Machine Learning
/ standards
Male
Middle Aged
Pandemics
Prognosis
Reproducibility of Results
Research Design
Retrospective Studies
SARS-CoV-2
/ isolation & purification
Spain
/ epidemiology
Treatment Outcome
Young Adult
COVID-19
electronic health record
machine learning
mortality
prediction
Journal
Journal of medical Internet research
ISSN: 1438-8871
Titre abrégé: J Med Internet Res
Pays: Canada
ID NLM: 100959882
Informations de publication
Date de publication:
14 04 2021
14 04 2021
Historique:
received:
07
12
2020
accepted:
08
03
2021
revised:
29
12
2020
pubmed:
2
4
2021
medline:
29
4
2021
entrez:
1
4
2021
Statut:
epublish
Résumé
The COVID-19 pandemic is probably the greatest health catastrophe of the modern era. Spain's health care system has been exposed to uncontrollable numbers of patients over a short period, causing the system to collapse. Given that diagnosis is not immediate, and there is no effective treatment for COVID-19, other tools have had to be developed to identify patients at the risk of severe disease complications and thus optimize material and human resources in health care. There are no tools to identify patients who have a worse prognosis than others. This study aimed to process a sample of electronic health records of patients with COVID-19 in order to develop a machine learning model to predict the severity of infection and mortality from among clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide more detailed insights into the disease. After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting algorithm was selected as the predictive method for this study. In addition, Shapley Additive Explanations was used to analyze the importance of the features of the resulting model. After data preprocessing, 1823 confirmed patients with COVID-19 and 32 predictor features were selected. On bootstrap validation, the extreme gradient boosting classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for the F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The 4 most relevant features for model prediction were lactate dehydrogenase activity, C-reactive protein levels, neutrophil counts, and urea levels. Our predictive model yielded excellent results in the differentiating among patients who died of COVID-19, primarily from among laboratory parameter values. Analysis of the resulting model identified a set of features with the most significant impact on the prediction, thus relating them to a higher risk of mortality.
Sections du résumé
BACKGROUND
The COVID-19 pandemic is probably the greatest health catastrophe of the modern era. Spain's health care system has been exposed to uncontrollable numbers of patients over a short period, causing the system to collapse. Given that diagnosis is not immediate, and there is no effective treatment for COVID-19, other tools have had to be developed to identify patients at the risk of severe disease complications and thus optimize material and human resources in health care. There are no tools to identify patients who have a worse prognosis than others.
OBJECTIVE
This study aimed to process a sample of electronic health records of patients with COVID-19 in order to develop a machine learning model to predict the severity of infection and mortality from among clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide more detailed insights into the disease.
METHODS
After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting algorithm was selected as the predictive method for this study. In addition, Shapley Additive Explanations was used to analyze the importance of the features of the resulting model.
RESULTS
After data preprocessing, 1823 confirmed patients with COVID-19 and 32 predictor features were selected. On bootstrap validation, the extreme gradient boosting classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for the F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The 4 most relevant features for model prediction were lactate dehydrogenase activity, C-reactive protein levels, neutrophil counts, and urea levels.
CONCLUSIONS
Our predictive model yielded excellent results in the differentiating among patients who died of COVID-19, primarily from among laboratory parameter values. Analysis of the resulting model identified a set of features with the most significant impact on the prediction, thus relating them to a higher risk of mortality.
Identifiants
pubmed: 33793407
pii: v23i4e26211
doi: 10.2196/26211
pmc: PMC8048712
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e26211Informations de copyright
©Juan L Domínguez-Olmedo, Álvaro Gragera-Martínez, Jacinto Mata, Victoria Pachón Álvarez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 14.04.2021.
Références
Nat Mach Intell. 2020 Jan;2(1):56-67
pubmed: 32607472
Nat Commun. 2020 Sep 7;11(1):4439
pubmed: 32895375
N Engl J Med. 2020 Apr 30;382(18):e41
pubmed: 32212516
Ann Med. 2021 Dec;53(1):103-116
pubmed: 33063540
J Med Internet Res. 2020 Aug 4;22(8):e16903
pubmed: 32749223
Thromb Res. 2020 Jul;191:145-147
pubmed: 32291094
J Med Internet Res. 2020 Oct 6;22(10):e21439
pubmed: 32976111
Lancet Respir Med. 2020 Apr;8(4):e21
pubmed: 32171062
N Engl J Med. 2020 Apr 23;382(17):e38
pubmed: 32268022
Front Pharmacol. 2019 Oct 07;10:1155
pubmed: 31649533
J Med Internet Res. 2020 Nov 6;22(11):e24018
pubmed: 33027032
Eur Respir J. 1996 Aug;9(8):1736-42
pubmed: 8866602
Healthcare (Basel). 2020 Jul 31;8(3):
pubmed: 32751894
BMJ. 2020 Feb 18;368:m641
pubmed: 32071063
Diagnostics (Basel). 2019 Nov 07;9(4):
pubmed: 31703364
Clin Chem Lab Med. 2020 Jun 25;58(7):1021-1028
pubmed: 32286245
Diabetes Metab J. 2020 Apr;44(2):349-353
pubmed: 32347027