Union with Recursive Feature Elimination: A feature selection framework to improve the classification performance of multi-category causes of death in colorectal cancer.
Colorectal Cancer
Feature Selection
Machine Learning
Multi-category Death Causes
U-RFE
Journal
Laboratory investigation; a journal of technical methods and pathology
ISSN: 1530-0307
Titre abrégé: Lab Invest
Pays: United States
ID NLM: 0376617
Informations de publication
Date de publication:
27 Dec 2023
27 Dec 2023
Historique:
received:
26
03
2023
revised:
05
12
2023
accepted:
20
12
2023
medline:
2
1
2024
pubmed:
2
1
2024
entrez:
29
12
2023
Statut:
aheadofprint
Résumé
Despite the use of machine learning tools, it is challenging to properly model cause-specific deaths in colorectal cancer (CRC) patients and choose appropriate treatments. Here, we propose an interesting feature selection framework, namely union with recursive feature elimination (U-RFE), to select the union feature sets that are crucial in CRC progression-specific mortality using the TCGA dataset. Based on the union feature sets, the performance of 5 classification algorithms, including logistic regression (LR), support vector machines (SVM), random forest (RF), eXtreme gradient boosting (XGBoost) and Stacking, were compared to identify the best model for classifying 4-category deaths. In the first stage of U-RFE, LR, SVM and RF were used as base estimators to obtain subsets containing the same number of features but not exactly the same specific features. Union analysis of the subsets was then performed to determine the final union feature set, that can effectively combine the advantages of different algorithms. We found that the U-RFE framework could improve various models' performance. Stacking outperformed LR, SVM, RF and XGBoost in most scenarios. When the target feature number of the RFE was set to 50 and the union feature set contains 298 deterministic features, the Stacking model achieves F1_weighted, Recall_weighted, Precision_weighted, Accuracy and Matthews Correlation Coefficient of 0.851, 0.864, 0.854, 0.864 and 0.717, respectively. The performance on the minority categories was also significantly improved. Therefore, this recursive-feature-elimination based approach of feature selection improves performances of classifying CRC deaths using clinical and 'omic data or those using other data with high feature redundancy and imbalance.
Identifiants
pubmed: 38158124
pii: S0023-6837(23)00263-5
doi: 10.1016/j.labinv.2023.100320
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
100320Informations de copyright
Copyright © 2023. Published by Elsevier Inc.