A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.
Batch effect correction
Cancer
Classification
Data scaling
Normalization
RNA-Seq
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
08 May 2024
08 May 2024
Historique:
received:
31
01
2024
accepted:
02
05
2024
medline:
9
5
2024
pubmed:
9
5
2024
entrez:
9
5
2024
Statut:
epublish
Résumé
RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Sections du résumé
BACKGROUND
BACKGROUND
RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.
RESULTS
RESULTS
We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.
CONCLUSION
CONCLUSIONS
By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Identifiants
pubmed: 38720247
doi: 10.1186/s12859-024-05801-x
pii: 10.1186/s12859-024-05801-x
doi:
Types de publication
Journal Article
Comparative Study
Langues
eng
Sous-ensembles de citation
IM
Pagination
181Subventions
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Informations de copyright
© 2024. The Author(s).
Références
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7.
pubmed: 10521349
doi: 10.1126/science.286.5439.531
Keyes TJ, Domizi P, Lo Y, Nolan GP, Davis KL. A Cancer biologist’s primer on machine learning applications in high-dimensional cytometry. Cytometry Pt A. 2020;97(8):782–99.
doi: 10.1002/cyto.a.24158
Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;19(1A):A68–77.
Carithers LJ, Ardlie K, Barcus M, Branton PA, Britton A, Buia SA, et al. A novel approach to high-quality postmortem tissue procurement: the GTEx project. Biopreserv Biobank. 2015;13(5):311–9.
pubmed: 26484571
pmcid: 4675181
doi: 10.1089/bio.2015.0032
Zhang J, Baran J, Cros A, Guberman JM, Haider S, Hsu J, et al. International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database J Biol Databases Curation. 2011;2011:bar026.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 2013;41:D991–5.
pubmed: 23193258
doi: 10.1093/nar/gks1193
Liñares-Blanco J, Pazos A, Fernandez-Lozano C. Machine learning analysis of TCGA cancer data. PeerJ Comput Sci. 2021;7:e584.
pubmed: 34322589
pmcid: 8293929
doi: 10.7717/peerj-cs.584
Dillies M, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012;14(6):671–83.
pubmed: 22988256
doi: 10.1093/bib/bbs046
Leek JT, Scharpf RB, Barvo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9.
pubmed: 20838408
doi: 10.1038/nrg2825
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:161.
doi: 10.1371/journal.pgen.0030161
Ten CD. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10(1):35.
doi: 10.1186/s13040-017-0155-3
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–13.
Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nature. 2022;23:169–81.
Alkhateeb A, Rueda L. Zseq: an approach for preprocessing next-generation sequencing data. J Comput Biol. 2017;24(8):746–55.
pubmed: 28414515
pmcid: 5563921
doi: 10.1089/cmb.2017.0021
Zhang Y, Yamaguchi R, Imoto S, Miyano S. Sequence-specific bias correction for RNA-seq data using recurrent neural networks. BMC Genomics. 2016;18(S1):1–6.
Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, et al. Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017;35:319–21.
pubmed: 28398307
pmcid: 6742427
doi: 10.1038/nbt.3838
Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010;10(4):278–91.
pubmed: 20676067
pmcid: 2920074
doi: 10.1038/tpj.2010.57
Hornung R, Boulesteix A, Causeur D. Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment. BMC Bioinform. 2016;17(1):1–19.
doi: 10.1186/s12859-015-0870-z
Hornung R, Causeur D, Bernau C, Boulesteix A. Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics. 2017;33(3):397–404.
pubmed: 27797760
doi: 10.1093/bioinformatics/btw650
Ellis SE, Collado-Torres L, Jaffe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res. 2018;46:e54–e54.
pubmed: 29514223
pmcid: 5961118
doi: 10.1093/nar/gky102
Leek JT, Evan Johnson W, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3.
pubmed: 22257669
pmcid: 3307112
doi: 10.1093/bioinformatics/bts034
Zhang Y, Jenkins DF, Manimaran S, Johnson WE. Alternative empirical Bayes models for adjusting for batch effects in genomic studies. BMC Bioinform. 2018;19(1):1–15.
doi: 10.1186/s12859-018-2263-6
Rule A, Birmingham A, Zuniga C, Altintas I, Huang S, Knight R, et al. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLoS Comput Biol. 2019;15(7):e1007007.
pubmed: 31344036
pmcid: 6657818
doi: 10.1371/journal.pcbi.1007007
Nowicki-Osuch K, Zhuang L, Cheung TS, Black EL, Masqué-Soler N, Devonshire G, et al. Single-cell RNA sequencing unifies developmental programs of esophageal and gastric intestinal metaplasia. Cancer Discov. 2023;13:1346–63.
pubmed: 36929873
pmcid: 10236154
doi: 10.1158/2159-8290.CD-22-0824
Liu Y, Liu J, Getz G, Lawrence MS, Saksena G, Voet D, et al. Comparative molecular analysis of gastrointestinal adenocarcinomas. Cancer Cell. 2018;33(4):721-735.e8.
pubmed: 29622466
pmcid: 5966039
doi: 10.1016/j.ccell.2018.03.010
Peran I, Madhavan S, Byers SW, Mccoy MD. Curation of the pancreatic ductal adenocarcinoma subset of the cancer genome atlas is essential for accurate conclusions about survival-related molecular mechanisms. Clin Cancer Res. 2018;24(16):3813–9.
pubmed: 29739787
doi: 10.1158/1078-0432.CCR-18-0290
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;29(1):15–21.
pubmed: 23104886
pmcid: 3530905
doi: 10.1093/bioinformatics/bts635
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.
pubmed: 31375807
pmcid: 7605509
doi: 10.1038/s41587-019-0201-4
Joseph VR. Optimal ratio for data splitting. Stat Anal Data Min. 2022;15(4):531–8.
doi: 10.1002/sam.11583
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493–500.
pubmed: 20022975
doi: 10.1093/bioinformatics/btp692
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5.
pubmed: 22872506
doi: 10.1007/s12064-012-0162-3
Bolstad B. preprocessCore: a collection of pre-processing functions. 2023. https://bioconductor.org/packages/release/bioc/html/preprocessCore.html .
Franks JM, Cai G, Whitfield ML. Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics. 2018;34(11):1868–74.
pubmed: 29360996
pmcid: 5972664
doi: 10.1093/bioinformatics/bty026
Ramos M, Schiffer L, Waldron L. TCGAutils: TCGA utility functions for data management. 2023. https://www.bioconductor.org/packages/release/bioc/html/TCGAutils.html .
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
pubmed: 25605792
pmcid: 4402510
doi: 10.1093/nar/gkv007
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Hsu C, Chang C, Lin C. A Practical Guide to Support Vector Classification. 2003.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
doi: 10.1007/BF00994018
Chang C, Lin C. LIBSVM: a library for support vector machines. 2011.
Giuliani A. The application of principal component analysis to drug discovery and biomedical data. Drug Deliv Today. 2017;22(7):1069–76.
doi: 10.1016/j.drudis.2017.01.005
Van Der Maaten L, Hinton G. Visualizing data using t-SNE. 2008.
McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. 2018.
Tsamardinos I, Rakhshani A, Lagani V. Performance-estimation properties of cross-validation-based protocols with simultaneous hyper-parameter optimization. 2015.
Radzi SFM, Karim MKA, Saripan MI, Rahman MAA, Isa INC, Ibahim MJ. Hyperparameter tuning and pipeline optimization via grid search method and tree-based autoML in breast cancer prediction. 2021.
Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVM regression. 2003.
Behera B, Kumaravelan G, Kumar BP. Performance evaluation of deep learning algorithms in biomedical document classification. ICoAC 2019.
Lundberg SM, Allen PG. A unified approach to interpreting model predictions. 2017.
Jones S, Beyers M, Shukla M, Xia F, Brettin T, Stevens R, et al. TULIP: an RNA-seq-based primary tumor type prediction tool using convolutional neural networks. Cancer Inform. 2022;21:11769351221139492.
pubmed: 36507076
pmcid: 9729992
doi: 10.1177/11769351221139491
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). 1965.
Gastwirth JL, Gel YR, Miao W. The impact of Levene's test of equality of variances on statistical theory and practice. 2009.
Hunter JD. Matplotlib: A 2D Graphics Environment. 2007.
Nakano R. Scikit-plot. 2018. https://github.com/reiinakano/scikit-plot .
Wickham H. ggplot2: elegant graphics for data analysis. 2nd ed. Berlin: Springer; 2016.
doi: 10.1007/978-3-319-24277-4
FC M, Davis TL. ggpattern: 'ggplot2' pattern geoms. 2022. https://github.com/trevorld/ggpattern .
Ntzani EE, Ioannidis JPA. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. 2003.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Prood Natl Acad Sci. 2001;98(26):15149–54.
doi: 10.1073/pnas.211566398
Wei IH, Shi Y, Jiang H, Kumar-Sinha C, Chinnaiyan AM. rna-seq accurately identifies cancer biomarker signatures to distinguish tissue of origin. Neoplasia. 2014;16(11):918–27.
pubmed: 25425966
pmcid: 4240918
doi: 10.1016/j.neo.2014.09.007
Li Y, Kang K, Krahn JM, Croutwater N, Lee K, Umbach DM, et al. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics. 2017;18(1):508.
pubmed: 28673244
pmcid: 5496318
doi: 10.1186/s12864-017-3906-0
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologist. 2022.
Moran S, Martínez-Cardús A, Sayols S, Musulén E, Balañá C, Estival-Gonzalez A, et al. Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol. 2016;17(10):1386–95.
pubmed: 27575023
doi: 10.1016/S1470-2045(16)30297-2
Xu Q, Chen J, Ni S, Tan C, Xu M, Dong L, et al. Pan-cancer transcriptome analysis reveals a gene expression signature for the identification of tumor tissue origin. Mod Pathol. 2016;29(6):546–56.
pubmed: 26990976
doi: 10.1038/modpathol.2016.60
Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets—improving meta-analysis and prediction of prognosis. 2008;1(1):42.
Adlung L, Cohen Y, Mor U, Elinav E. Machine learning in clinical decision making. Med. 2021;2(6):642–65.
pubmed: 35590138
doi: 10.1016/j.medj.2021.04.006
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2006;8(1):118–27.
pubmed: 16632515
doi: 10.1093/biostatistics/kxj037
Wolpert DH, Macready WG. No free lunch theorems for optimization. TEVC. 1997;1(1):67–82.
Nygaard V, Rødland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016;17(1):29–39.
pubmed: 26272994
doi: 10.1093/biostatistics/kxv027
Luijken K, Groenwold RHH, Van Calster B, Steyerberg EW, Van Smeden M. Impact of predictor measurement heterogeneity across settings on the performance of prediction models: a measurement error perspective. Stat Med. 2019;38(18):3444–59.
pubmed: 31148207
pmcid: 6619392
doi: 10.1002/sim.8183
Cao XH, Stojkovic I, Obradovic Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinform. 2016;17(1):359.
doi: 10.1186/s12859-016-1236-x
Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 2012;7(3):500–7.
pubmed: 22343431
pmcid: 3398141
doi: 10.1038/nprot.2011.457
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902.
pubmed: 25150836
pmcid: 4404308
doi: 10.1038/nbt.2931
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND, editors. Dataset shift in machine learning. MIT Press: Cambridge; 2008. p. 131–60.
doi: 10.7551/mitpress/7921.003.0013
Sugiyama M, Suzuki T, Nakajima S, Kashima H, Von Bünau P, Kawanabe M. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math. 2008;60(4):699–746.
doi: 10.1007/s10463-008-0197-x
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun. ACM 2020;63(11).
Dincer AB, Janizek JD, Lee S. Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics. 2020;36:i573–82.
pubmed: 33381842
pmcid: 7773484
doi: 10.1093/bioinformatics/btaa796
Upadhyay U, Jain A. Removal of batch effects using generative adversarial networks. 2019.