Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer.
Breast cancer
Data integration
Interpretability
Machine learning
Survival analysis
Journal
Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969
Informations de publication
Date de publication:
2023
2023
Historique:
entrez:
13
10
2022
pubmed:
14
10
2022
medline:
18
10
2022
Statut:
ppublish
Résumé
Breast cancer is one of the most common cancers in women worldwide, which causes an enormous number of deaths annually. However, early diagnosis of breast cancer can improve survival outcomes enabling simpler and more cost-effective treatments. The recent increase in data availability provides unprecedented opportunities to apply data-driven and machine learning methods to identify early-detection prognostic factors capable of predicting the expected survival and potential sensitivity to treatment of patients, with the final aim of enhancing clinical outcomes. This tutorial presents a protocol for applying machine learning models in survival analysis for both clinical and transcriptomic data. We show that integrating clinical and mRNA expression data is essential to explain the multiple biological processes driving cancer progression. Our results reveal that machine-learning-based models such as random survival forests, gradient boosted survival model, and survival support vector machine can outperform the traditional statistical methods, i.e., Cox proportional hazard model. The highest C-index among the machine learning models was recorded when using survival support vector machine, with a value 0.688, whereas the C-index recorded using the Cox model was 0.677. Shapley Additive Explanation (SHAP) values were also applied to identify the feature importance of the models and their impact on the prediction outcomes.
Identifiants
pubmed: 36227551
doi: 10.1007/978-1-0716-2617-7_16
doi:
Substances chimiques
RNA, Messenger
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
325-393Informations de copyright
© 2023. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.
Références
Ferlay J, Héry C, Autier P, Sankaranarayanan R (2010) Global burden of breast cancer. In: Breast cancer epidemiology. Springer, pp 1–19
Cancer Research UK (2021) Breast cancer statistics. URL https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/breast-cancer
Office for National Statistics (2019) Cancer survival in England Cancer survival in England: national estimates for patients followed up to 2017. URL https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/cancersurvivalinengland/nationalestimatesforpatientsfollowedupto2017
Robson M, Im SA, Senkus E, et al (2017) Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. New Engl J Med 377(6):523–533
pubmed: 28578601
doi: 10.1056/NEJMoa1706450
De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33(30):5310–5329
pubmed: 25042390
doi: 10.1002/sim.6246
Hira MT, Razzaque M, Angione C et al (2021) Integrated multi-omics analysis of ovarian cancer using variational autoencoders. Sci Rep 11(1):1–16
Conesa A, Beck S (2019) Making multi-omics data accessible to researchers. Sci Data 6(1):1–4
doi: 10.1038/s41597-019-0258-4
Vijayakumar S, Conway M, Lió P, Angione C (2018) Optimization of multi-omic genome-scale models: Methodologies, hands-on tutorial, and perspectives. Metabolic Netw Reconstr Model 1716:389–408
Angione C (2019) Human systems biology and metabolic modelling: a review–from disease metabolism to precision medicine. BioMed Res Int 2019
Zhao Z, Zhang KN, Wang Q et al (2021) Chinese Glioma Genome Atlas (CGGA): a comprehensive resource with functional genomic data from Chinese glioma patients. Genomics, proteomics Bioinformatics 19(1):1
Iuliano A, Occhipinti A, Angelini C et al (2018) Combining pathway identification and breast cancer survival prediction via screening-network methods. Front Genet 9:206
pubmed: 29963073
pmcid: 6011013
doi: 10.3389/fgene.2018.00206
Győrffy B (2021) Survival analysis across the entire transcriptome identifies biomarkers with the highest prognostic power in breast cancer. Comput Struct Biotechnol J 19:4101–4109
Higdon R, Earl RK, Stanberry L et al (2015) The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. Omics J Integr Biol 19(4):197–208
doi: 10.1089/omi.2015.0020
Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18(1):1–15
doi: 10.1186/s13059-017-1215-1
Yaneske E, Angione C (2018) The poly-omics of ageing through individual-based metabolic modelling. BMC Bioinf 19(14):83–96
Yan J, Risacher SL, Shen L, Saykin AJ (2018) Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Brief Bioinf 19(6):1370–1381
Occhipinti A, Hamadi Y, Kugler H et al (2020) Discovering essential multiple gene effects through large scale optimization: an application to human cancer metabolism. IEEE/ACM Trans Comput Biol Bioinf 18:2339
doi: 10.1109/TCBB.2020.2973386
Eyassu F, Angione C (2017) Modelling pyruvate dehydrogenase under hypoxia and its role in cancer metabolism. R Soc Open Sci 4(10):170360
pubmed: 29134060
pmcid: 5666243
doi: 10.1098/rsos.170360
Zhao L, Dong Q, Luo C et al (2021) DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J 19:2719–2725
pubmed: 34093987
pmcid: 8131983
doi: 10.1016/j.csbj.2021.04.067
Yaneske E, Zampieri G, Bertoldi L et al (2021) Genome-scale metabolic modelling of SARS-CoV-2 in cancer cells reveals an increased shift to glycolytic energy production. FEBS Lett 595(18):2350–2365
pubmed: 34409594
pmcid: 8427129
doi: 10.1002/1873-3468.14180
Angione C (2018) Integrating splice-isoform expression into genome-scale models characterizes breast cancer metabolism. Bioinformatics 34(3):494–501
pubmed: 28968777
doi: 10.1093/bioinformatics/btx562
Anaya J, Reon B, Chen WM et al (2016) A pan-cancer analysis of prognostic genes. PeerJ 3:e1499
pubmed: 27047702
pmcid: 4815555
doi: 10.7717/peerj.1499
Zhu B, Song N, Shen R et al (2017) Integrating clinical and multiple omics data for prognostic assessment across human cancers. Sci Rep 7(1):1–13
doi: 10.1038/s41598-017-17031-8
Islam MM, Haque MR, Iqbal H et al (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):1–14
doi: 10.1007/s42979-020-00305-w
Zampieri G, Vijayakumar S, Yaneske E, Angione C (2019) Machine and deep learning meet genome-scale metabolic modeling. PLoS Comput Biol 15(7):e1007084
pubmed: 31295267
pmcid: 6622478
doi: 10.1371/journal.pcbi.1007084
Alabi RO, Elmusrati M, Sawazaki-Calone I et al (2020) Comparison of supervised machine learning classification techniques in prediction of locoregional recurrences in early oral tongue cancer. Int J Med Informatics 136:104068
doi: 10.1016/j.ijmedinf.2019.104068
Culley C, Vijayakumar S, Zampieri G, Angione C (2020) A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth. Proc Natl Acad Sci 117(31):18869–18879
pubmed: 32675233
pmcid: 7414140
doi: 10.1073/pnas.2002959117
Chugh G, Kumar S, Singh N (2021) Survey on machine learning and deep learning applications in breast cancer diagnosis. Cogn Comput:1–20
Akram M, Iqbal M, Daniyal M, Khan AU (2017) Awareness and current knowledge of breast cancer. Biol Res 50(1):1–23
doi: 10.1186/s40659-017-0140-9
Simmons CP, McMillan DC, McWilliams K et al (2017) Prognostic tools in patients with advanced cancer: a systematic review. J Pain Symptom Manag 53(5):962–970
doi: 10.1016/j.jpainsymman.2016.12.330
Ascolani G, Occhipinti A, Liò P (2015) Modelling circulating tumour cells for personalised survival prediction in metastatic breast cancer. PLoS Comput Biol 11(5):e1004199
pubmed: 25978366
pmcid: 4433130
doi: 10.1371/journal.pcbi.1004199
Wang P, Li Y, Reddy CK (2019) Machine learning for survival analysis: A survey. ACM Comput Surv (CSUR) 51(6):1–36
doi: 10.1145/3214306
Mariotto AB, Noone AM, Howlader N et al (2014) Cancer survival: an overview of measures, uses, and interpretation. J Natl Cancer Inst Monographs 2014(49):145–186
doi: 10.1093/jncimonographs/lgu024
Austin PC (2017) A tutorial on multilevel survival analysis: methods, models and applications. Int Stat Rev 85(2):185–203
pubmed: 29307954
pmcid: 5756088
doi: 10.1111/insr.12214
Iuliano A, Occhipinti A, Angelini C et al (2016) Cancer markers selection using network-based cox regression: a methodological and computational practice. Front Physiol 7:208
pubmed: 27378931
pmcid: 4911360
doi: 10.3389/fphys.2016.00208
Yang Y, Lu Q, Shao X et al (2018) Development of a three-gene prognostic signature for hepatitis b virus associated hepatocellular carcinoma based on integrated transcriptomic analysis. J Cancer 9(11):1989
pubmed: 29896284
pmcid: 5995946
doi: 10.7150/jca.23762
Kiebish MA, Cullen J, Mishra P et al (2020) Multi-omic serum biomarkers for prognosis of disease progression in prostate cancer. J Transl Med 18(1):1–10
doi: 10.1186/s12967-019-02185-y
Hao J, Kim Y, Mallavarapu T et al (2019) Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med Genomics 12(10):1–13
Moncada-Torres A, van Maaren MC, Hendriks MP et al (2021) Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep 11(1):1–13
doi: 10.1038/s41598-021-86327-7
Akai H, Yasaka K, Kunimatsu A et al (2018) Predicting prognosis of resected hepatocellular carcinoma by radiomics analysis with random survival forest. Diagn Interv imaging 99(10):643–651
pubmed: 29910166
doi: 10.1016/j.diii.2018.05.008
Bibault JE, Chang DT, Xing L (2021) Development and validation of a model to predict survival in colorectal cancer using a gradient-boosted machine. Gut 70(5):884–889
pubmed: 32887732
doi: 10.1136/gutjnl-2020-321799
Wang H, Zheng B, Yoon SW, Ko HS (2018) A support vector machine-based ensemble algorithm for breast cancer diagnosis. Eur J Oper Res 267(2):687–699
doi: 10.1016/j.ejor.2017.12.001
Ching T, Zhu X, Garmire LX (2018) Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput Biol 14(4):e1006076
pubmed: 29634719
pmcid: 5909924
doi: 10.1371/journal.pcbi.1006076
Huang Z, Zhan X, Xiang S et al (2019) SALMON: survival analysis learning with multi-omics neural networks on breast cancer. Front Genet 10:166
pubmed: 30906311
pmcid: 6419526
doi: 10.3389/fgene.2019.00166
Cheon S, Agarwal A, Popovic M et al (2016) The accuracy of clinicians’ predictions of survival in advanced cancer: a review. Ann Palliat Med 5(1):22–29
pubmed: 26841812
Pereira B, Chin SF, Rueda OM et al (2016) The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun 7(1):1–16. https://doi.org/10.1038/ncomms11479
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems, pp 4768–4777
Singh R, Mukhopadhyay K (2011) Survival analysis in clinical trials: Basics and must know areas. Perspect Clin Res 2(4):145
pubmed: 22145125
pmcid: 3227332
doi: 10.4103/2229-3485.86872
Cox DR (1972) Regression models and life-tables. J R Stat Soc B (Methodol) 34(2):187–202
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Annals Appl Stat 2(3):841–860
doi: 10.1214/08-AOAS169
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
doi: 10.1023/A:1010933404324
Azar AT, Elshazly HI, Hassanien AE, Elkorany AM (2014) A random forest classifier for lymph diseases. Comput Methods Programs Biomed 113(2):465–473
pubmed: 24290902
doi: 10.1016/j.cmpb.2013.11.004
Qu Z, Li H, Wang Y et al (2020) Detection of electricity theft behavior based on improved synthetic minority oversampling technique and random forest classifier. Energies 13(8):2039
doi: 10.3390/en13082039
Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546
pubmed: 7069920
doi: 10.1001/jama.1982.03320430047030
Hothorn T, Bühlmann P, Dudoit S et al (2006) Survival ensembles. Biostatistics 7(3):355–373
pubmed: 16344280
doi: 10.1093/biostatistics/kxj011
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobotics 7:21
doi: 10.3389/fnbot.2013.00021
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
doi: 10.1214/aos/1013203451
Ridgeway G (1999) The state of boosting. Comput Sci Stat:172–181
Khan FM, Zubek VB (2008) Support vector regression for censored data (SVRC): a novel tool for survival analysis. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 863–868
Vapnik V (1999) The nature of statistical learning theory. Springer Science & Business Media
Pölsterl S, Navab N, Katouzian A (2015) Fast training of support vector machines for survival analysis. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 243–259
Leger S, Zwanenburg A, Pilz K et al (2017) A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling. Sci Rep 7(1):1–11
doi: 10.1038/s41598-017-13448-3
Gárate-Escamila AK, El Hassani AH, Andrès E (2020) Classification models for heart disease prediction using feature selection and PCA. Informatics Med Unlocked 19:100330
doi: 10.1016/j.imu.2020.100330
Ewees AA, Al-qaness MA, Abualigah L et al (2021) Boosting arithmetic optimization algorithm with genetic algorithm operators for feature selection: Case study on Cox proportional hazards model. Mathematics 9(18):2321
doi: 10.3390/math9182321
Schemper M, Kaider A, Wakounig S, Heinze G (2013) Estimating the correlation of bivariate failure times under censoring. Stat Med 32(27):4781–4790
pubmed: 23775542
doi: 10.1002/sim.5874
Su Z, Tang B, Liu Z, Qin Y (2015) Multi-fault diagnosis for rotating machinery based on orthogonal supervised linear local tangent space alignment and least square support vector machine. Neurocomputing 157:208–222
doi: 10.1016/j.neucom.2015.01.016
Rodrigues D, Pereira LA, Nakamura RY et al (2014) A wrapper approach for feature selection based on Bat algorithm and optimum-path forest. Expert Syst Appl 41(5):2250–2258
doi: 10.1016/j.eswa.2013.09.023
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
pubmed: 16119262
doi: 10.1109/TPAMI.2005.159
Curtis C, Shah SP, Chin SF et al (2012) The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403):346–352
pubmed: 22522925
pmcid: 3440846
doi: 10.1038/nature10983
Pölsterl S (2020) scikit-survival: A library for time-to-event analysis built on top of scikit-learn. J Mach Learn Res 21(212):1–6
Van Rossum G, Drake FL (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA
Kim B, Khanna R, Koyejo OO (2016) Examples are not enough, learn to criticize! Criticism for Interpretability. In: Advances in neural information processing systems, vol 29
Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DKW, Newman SF, Kim J, et al (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760
pubmed: 31001455
pmcid: 6467492
doi: 10.1038/s41551-018-0304-0
Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinformatics 11(2):253–264
pubmed: 19965979
doi: 10.1093/bib/bbp059
Fryett JJ, Inshaw J, Morris AP, Cordell HJ (2018) Comparison of methods for transcriptome imputation through application to two common complex diseases. Eur J Hum Genet 26(11):1658–1667
pubmed: 29976976
pmcid: 6189136
doi: 10.1038/s41431-018-0176-5
Shahjaman M, Rahman MR, Islam T et al (2021) rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data. Comput Biol Med 138:104911
pubmed: 34634637
doi: 10.1016/j.compbiomed.2021.104911
Park S, Shin B, Shim WS et al. (2019) Wx: a neural network-based feature selection algorithm for transcriptomic data. Sci Rep 9(1):1–9
Han Y, Huang L, Zhou F (2021) Zoo: Selecting transcriptomic and methylomic biomarkers by ensembling animal-inspired swarm intelligence feature selection algorithms. Genes 12(11):1814
pubmed: 34828418
pmcid: 8621246
doi: 10.3390/genes12111814
Iuliano A, Occhipinti A, Angelini C et al (2021) COSMONET: An R package for survival analysis using screening-network methods. Mathematics 9(24):3262
doi: 10.3390/math9243262
Katzman JL, Shaham U, Cloninger A et al (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18(1):1–12
doi: 10.1186/s12874-018-0482-1
Poirion OB, Jing Z, Chaudhary K et al (2021) DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med 13(1):1–15
doi: 10.1186/s13073-021-00930-x