ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest.
Auto Encoder
Cancer subtyping
Gene expression data
Machine learning
Random forest
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
19 Jul 2023
19 Jul 2023
Historique:
received:
15
11
2022
accepted:
13
07
2023
medline:
21
7
2023
pubmed:
20
7
2023
entrez:
19
7
2023
Statut:
epublish
Résumé
Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype . Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Sections du résumé
BACKGROUND
BACKGROUND
Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes.
RESULTS
RESULTS
In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype .
CONCLUSIONS
CONCLUSIONS
Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Identifiants
pubmed: 37468832
doi: 10.1186/s12859-023-05412-y
pii: 10.1186/s12859-023-05412-y
pmc: PMC10354904
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
289Subventions
Organisme : National Natural Science Foundation of China
ID : 61972134
Organisme : Young Elite Teachers in Henan Province
ID : 2020GGJS050
Organisme : Innovative and Scientific Research Team of Henan Polytechnic University
ID : T2021-3
Organisme : Doctor Foundation of Henan Polytechnic University
ID : B2018-36
Informations de copyright
© 2023. The Author(s).
Références
FEBS Lett. 2000 Aug 25;480(1):17-24
pubmed: 10967323
Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8418-23
pubmed: 12829800
Biosci Rep. 2020 Feb 28;40(2):
pubmed: 32043523
Front Genet. 2020 Sep 10;11:979
pubmed: 33133130
J Clin Bioinforma. 2011 Dec 23;1:37
pubmed: 22196354
Lancet Oncol. 2016 Jul;17(7):1004-1018
pubmed: 27312051
J Am Stat Assoc. 2010 Jun 1;105(490):713-726
pubmed: 20811510
Front Oncol. 2021 Sep 14;11:736725
pubmed: 34595119
Comput Intell Neurosci. 2022 Jun 24;2022:7325064
pubmed: 35785096
Genomics. 2012 Jun;99(6):323-9
pubmed: 22546560
J Cell Mol Med. 2021 Aug;25(15):7307-7320
pubmed: 34191390
BMC Cancer. 2022 Feb 12;22(1):165
pubmed: 35151276
Cell. 2021 Apr 1;184(7):1661-1670
pubmed: 33798439
IEEE Trans Pattern Anal Mach Intell. 1979 Feb;1(2):224-7
pubmed: 21868852
Cell Physiol Biochem. 2016;38(1):393-400
pubmed: 26824458
Ann Transl Med. 2016 Jun;4(11):218
pubmed: 27386492
Cell. 2011 Mar 4;144(5):646-74
pubmed: 21376230
J BUON. 2020 Mar-Apr;25(2):621-626
pubmed: 32521844
Bioinformatics. 2020 Mar 1;36(5):1476-1483
pubmed: 31603461
Cancer Res. 2010 May 15;70(10):3870-6
pubmed: 20406990
Cancer Cell Int. 2017 Sep 25;17:83
pubmed: 29021715
Comput Biol Med. 2017 Dec 1;91:213-221
pubmed: 29100115
J Clin Invest. 2007 Nov;117(11):3155-63
pubmed: 17975657
J Clin Oncol. 2009 Mar 10;27(8):1160-7
pubmed: 19204204
Gland Surg. 2022 Feb;11(2):389-411
pubmed: 35284318
Bioinformatics. 2009 Nov 15;25(22):2906-12
pubmed: 19759197
Bioinformatics. 2010 Jan 1;26(1):139-40
pubmed: 19910308
World J Clin Oncol. 2014 Aug 10;5(3):412-24
pubmed: 25114856
Eur J Histochem. 2022 Apr 07;66(2):
pubmed: 35388661
Aging (Albany NY). 2020 Dec 3;13(1):1332-1356
pubmed: 33291081
Nat Rev Clin Oncol. 2018 Jul;15(7):459-466
pubmed: 29666440
BMC Bioinformatics. 2018 Apr 11;19(Suppl 5):118
pubmed: 29671390
Bioinformatics. 2006 Jun 15;22(12):1540-2
pubmed: 16595560
Science. 1997 Nov 7;278(5340):1043-50
pubmed: 9353177
Ann Appl Stat. 2013 Apr 9;7(1):269-294
pubmed: 24587839
Oncogene. 2017 Jul 13;36(28):3957-3963
pubmed: 28288141
Nature. 2012 Apr 18;486(7403):346-52
pubmed: 22522925
Neural Netw. 2004 Jan;17(1):113-26
pubmed: 14690712