Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers.
Lung cancer
RNA-Seq
biomarkers
random forest classifier
supervised machine learning
Journal
Cancer informatics
ISSN: 1176-9351
Titre abrégé: Cancer Inform
Pays: United States
ID NLM: 101258149
Informations de publication
Date de publication:
2023
2023
Historique:
received:
04
01
2023
accepted:
17
03
2023
medline:
28
4
2023
pubmed:
28
4
2023
entrez:
28
4
2023
Statut:
epublish
Résumé
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
Identifiants
pubmed: 37113644
doi: 10.1177/11769351231167992
pii: 10.1177_11769351231167992
pmc: PMC10126698
doi:
Types de publication
Journal Article
Langues
eng
Pagination
11769351231167992Informations de copyright
© The Author(s) 2023.
Déclaration de conflit d'intérêts
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Références
Nat Protoc. 2016 Sep;11(9):1650-67
pubmed: 27560171
JAMA. 2016 Aug 2;316(5):533-4
pubmed: 27483067
PeerJ Comput Sci. 2021 Aug 12;7:e670
pubmed: 34458574
BMC Cancer. 2019 May 17;19(1):464
pubmed: 31101024
Cancer Inform. 2022 May 13;21:11769351221097593
pubmed: 35586731
J Thorac Oncol. 2016 Aug;11(8):1233-1241
pubmed: 27133774
Nature. 2018 Jan 24;553(7689):446-454
pubmed: 29364287
Can Assoc Radiol J. 2021 Feb;72(1):86-97
pubmed: 32735493
Lung. 2004;182(3):151-62
pubmed: 15526754
Front Genet. 2013 Dec 17;4:288
pubmed: 24381581
Nat Rev Clin Oncol. 2017 Sep;14(9):549-561
pubmed: 28534531
BMC Genomics. 2012 Jun 18;13 Suppl 4:S2
pubmed: 22759650
J Hematol Oncol. 2020 Dec 4;13(1):166
pubmed: 33276803
Ann Transl Med. 2018 Apr;6(8):145
pubmed: 29862234
Front Neuroinform. 2014 Feb 21;8:14
pubmed: 24600388
Radiology. 2018 Jan;286(1):307-315
pubmed: 28727543
Oncologist. 2009 Apr;14(4):399-411
pubmed: 19357226
Genome Biol. 2014;15(12):550
pubmed: 25516281
Neuropsychiatr Dis Treat. 2008 Apr;4(2):353-63
pubmed: 18728741
Cancer Genomics Proteomics. 2018 Jan-Feb;15(1):41-51
pubmed: 29275361
Life (Basel). 2021 Sep 09;11(9):
pubmed: 34575089
Heliyon. 2020 Aug 29;6(8):e04813
pubmed: 32913912
Nat Biotechnol. 2015 Mar;33(3):243-6
pubmed: 25748911
Nat Methods. 2015 Apr;12(4):357-60
pubmed: 25751142
Am J Respir Crit Care Med. 2021 Aug 15;204(4):445-453
pubmed: 33823116
Nucleic Acids Res. 2021 Jul 2;49(W1):W317-W325
pubmed: 34086934
Nat Rev Clin Oncol. 2017 Dec;14(12):749-762
pubmed: 28975929
Clin Lung Cancer. 2012 Jul;13(4):252-66
pubmed: 22154278
Comput Ind Eng. 2022 Mar;165:107912
pubmed: 35013637