Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data.
Chronic obstructive pulmonary disease
Disease risk prediction
Imbalanced data
Machine learning
Journal
Computer methods and programs in biomedicine
ISSN: 1872-7565
Titre abrégé: Comput Methods Programs Biomed
Pays: Ireland
ID NLM: 8506513
Informations de publication
Date de publication:
Mar 2023
Mar 2023
Historique:
received:
09
04
2022
revised:
25
11
2022
accepted:
04
01
2023
pubmed:
15
1
2023
medline:
22
2
2023
entrez:
14
1
2023
Statut:
ppublish
Résumé
Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency. We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance. The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE. This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.
Sections du résumé
BACKGROUND AND OBJECTIVE
OBJECTIVE
Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency.
METHODS
METHODS
We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance.
RESULTS
RESULTS
The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE.
CONCLUSIONS
CONCLUSIONS
This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.
Identifiants
pubmed: 36640604
pii: S0169-2607(23)00007-X
doi: 10.1016/j.cmpb.2023.107340
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
107340Informations de copyright
Copyright © 2023 The Author(s). Published by Elsevier B.V. All rights reserved.
Déclaration de conflit d'intérêts
Declaration of Competing Interest We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled. Both authors declare no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous three years, and no other relationships or activities that could appear to have influenced the submitted work.