A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
08 May 2024
Historique:
received: 31 01 2024
accepted: 02 05 2024
medline: 9 5 2024
pubmed: 9 5 2024
entrez: 9 5 2024
Statut: epublish

Résumé

RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Sections du résumé

BACKGROUND BACKGROUND
RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.
RESULTS RESULTS
We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer.
CONCLUSION CONCLUSIONS
By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Identifiants

pubmed: 38720247
doi: 10.1186/s12859-024-05801-x
pii: 10.1186/s12859-024-05801-x
doi:

Types de publication

Journal Article Comparative Study

Langues

eng

Sous-ensembles de citation

IM

Pagination

181

Subventions

Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325
Organisme : NIGMS of the National Institutes of Health
ID : 5P20GM121325

Informations de copyright

© 2024. The Author(s).

Références

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–7.
pubmed: 10521349 doi: 10.1126/science.286.5439.531
Keyes TJ, Domizi P, Lo Y, Nolan GP, Davis KL. A Cancer biologist’s primer on machine learning applications in high-dimensional cytometry. Cytometry Pt A. 2020;97(8):782–99.
doi: 10.1002/cyto.a.24158
Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;19(1A):A68–77.
Carithers LJ, Ardlie K, Barcus M, Branton PA, Britton A, Buia SA, et al. A novel approach to high-quality postmortem tissue procurement: the GTEx project. Biopreserv Biobank. 2015;13(5):311–9.
pubmed: 26484571 pmcid: 4675181 doi: 10.1089/bio.2015.0032
Zhang J, Baran J, Cros A, Guberman JM, Haider S, Hsu J, et al. International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data. Database J Biol Databases Curation. 2011;2011:bar026.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 2013;41:D991–5.
pubmed: 23193258 doi: 10.1093/nar/gks1193
Liñares-Blanco J, Pazos A, Fernandez-Lozano C. Machine learning analysis of TCGA cancer data. PeerJ Comput Sci. 2021;7:e584.
pubmed: 34322589 pmcid: 8293929 doi: 10.7717/peerj-cs.584
Dillies M, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012;14(6):671–83.
pubmed: 22988256 doi: 10.1093/bib/bbs046
Leek JT, Scharpf RB, Barvo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9.
pubmed: 20838408 doi: 10.1038/nrg2825
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:161.
doi: 10.1371/journal.pgen.0030161
Ten CD. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10(1):35.
doi: 10.1186/s13040-017-0155-3
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–13.
Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nature. 2022;23:169–81.
Alkhateeb A, Rueda L. Zseq: an approach for preprocessing next-generation sequencing data. J Comput Biol. 2017;24(8):746–55.
pubmed: 28414515 pmcid: 5563921 doi: 10.1089/cmb.2017.0021
Zhang Y, Yamaguchi R, Imoto S, Miyano S. Sequence-specific bias correction for RNA-seq data using recurrent neural networks. BMC Genomics. 2016;18(S1):1–6.
Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, et al. Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017;35:319–21.
pubmed: 28398307 pmcid: 6742427 doi: 10.1038/nbt.3838
Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010;10(4):278–91.
pubmed: 20676067 pmcid: 2920074 doi: 10.1038/tpj.2010.57
Hornung R, Boulesteix A, Causeur D. Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment. BMC Bioinform. 2016;17(1):1–19.
doi: 10.1186/s12859-015-0870-z
Hornung R, Causeur D, Bernau C, Boulesteix A. Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics. 2017;33(3):397–404.
pubmed: 27797760 doi: 10.1093/bioinformatics/btw650
Ellis SE, Collado-Torres L, Jaffe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res. 2018;46:e54–e54.
pubmed: 29514223 pmcid: 5961118 doi: 10.1093/nar/gky102
Leek JT, Evan Johnson W, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3.
pubmed: 22257669 pmcid: 3307112 doi: 10.1093/bioinformatics/bts034
Zhang Y, Jenkins DF, Manimaran S, Johnson WE. Alternative empirical Bayes models for adjusting for batch effects in genomic studies. BMC Bioinform. 2018;19(1):1–15.
doi: 10.1186/s12859-018-2263-6
Rule A, Birmingham A, Zuniga C, Altintas I, Huang S, Knight R, et al. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLoS Comput Biol. 2019;15(7):e1007007.
pubmed: 31344036 pmcid: 6657818 doi: 10.1371/journal.pcbi.1007007
Nowicki-Osuch K, Zhuang L, Cheung TS, Black EL, Masqué-Soler N, Devonshire G, et al. Single-cell RNA sequencing unifies developmental programs of esophageal and gastric intestinal metaplasia. Cancer Discov. 2023;13:1346–63.
pubmed: 36929873 pmcid: 10236154 doi: 10.1158/2159-8290.CD-22-0824
Liu Y, Liu J, Getz G, Lawrence MS, Saksena G, Voet D, et al. Comparative molecular analysis of gastrointestinal adenocarcinomas. Cancer Cell. 2018;33(4):721-735.e8.
pubmed: 29622466 pmcid: 5966039 doi: 10.1016/j.ccell.2018.03.010
Peran I, Madhavan S, Byers SW, Mccoy MD. Curation of the pancreatic ductal adenocarcinoma subset of the cancer genome atlas is essential for accurate conclusions about survival-related molecular mechanisms. Clin Cancer Res. 2018;24(16):3813–9.
pubmed: 29739787 doi: 10.1158/1078-0432.CCR-18-0290
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;29(1):15–21.
pubmed: 23104886 pmcid: 3530905 doi: 10.1093/bioinformatics/bts635
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15.
pubmed: 31375807 pmcid: 7605509 doi: 10.1038/s41587-019-0201-4
Joseph VR. Optimal ratio for data splitting. Stat Anal Data Min. 2022;15(4):531–8.
doi: 10.1002/sam.11583
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493–500.
pubmed: 20022975 doi: 10.1093/bioinformatics/btp692
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5.
pubmed: 22872506 doi: 10.1007/s12064-012-0162-3
Bolstad B. preprocessCore: a collection of pre-processing functions. 2023. https://bioconductor.org/packages/release/bioc/html/preprocessCore.html .
Franks JM, Cai G, Whitfield ML. Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics. 2018;34(11):1868–74.
pubmed: 29360996 pmcid: 5972664 doi: 10.1093/bioinformatics/bty026
Ramos M, Schiffer L, Waldron L. TCGAutils: TCGA utility functions for data management. 2023. https://www.bioconductor.org/packages/release/bioc/html/TCGAutils.html .
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
pubmed: 25605792 pmcid: 4402510 doi: 10.1093/nar/gkv007
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Hsu C, Chang C, Lin C. A Practical Guide to Support Vector Classification. 2003.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
doi: 10.1007/BF00994018
Chang C, Lin C. LIBSVM: a library for support vector machines. 2011.
Giuliani A. The application of principal component analysis to drug discovery and biomedical data. Drug Deliv Today. 2017;22(7):1069–76.
doi: 10.1016/j.drudis.2017.01.005
Van Der Maaten L, Hinton G. Visualizing data using t-SNE. 2008.
McInnes L, Healy J, Saul N, Großberger L. UMAP: uniform manifold approximation and projection. 2018.
Tsamardinos I, Rakhshani A, Lagani V. Performance-estimation properties of cross-validation-based protocols with simultaneous hyper-parameter optimization. 2015.
Radzi SFM, Karim MKA, Saripan MI, Rahman MAA, Isa INC, Ibahim MJ. Hyperparameter tuning and pipeline optimization via grid search method and tree-based autoML in breast cancer prediction. 2021.
Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVM regression. 2003.
Behera B, Kumaravelan G, Kumar BP. Performance evaluation of deep learning algorithms in biomedical document classification. ICoAC 2019.
Lundberg SM, Allen PG. A unified approach to interpreting model predictions. 2017.
Jones S, Beyers M, Shukla M, Xia F, Brettin T, Stevens R, et al. TULIP: an RNA-seq-based primary tumor type prediction tool using convolutional neural networks. Cancer Inform. 2022;21:11769351221139492.
pubmed: 36507076 pmcid: 9729992 doi: 10.1177/11769351221139491
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). 1965.
Gastwirth JL, Gel YR, Miao W. The impact of Levene's test of equality of variances on statistical theory and practice. 2009.
Hunter JD. Matplotlib: A 2D Graphics Environment. 2007.
Nakano R. Scikit-plot. 2018. https://github.com/reiinakano/scikit-plot .
Wickham H. ggplot2: elegant graphics for data analysis. 2nd ed. Berlin: Springer; 2016.
doi: 10.1007/978-3-319-24277-4
FC M, Davis TL. ggpattern: 'ggplot2' pattern geoms. 2022. https://github.com/trevorld/ggpattern .
Ntzani EE, Ioannidis JPA. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. 2003.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Prood Natl Acad Sci. 2001;98(26):15149–54.
doi: 10.1073/pnas.211566398
Wei IH, Shi Y, Jiang H, Kumar-Sinha C, Chinnaiyan AM. rna-seq accurately identifies cancer biomarker signatures to distinguish tissue of origin. Neoplasia. 2014;16(11):918–27.
pubmed: 25425966 pmcid: 4240918 doi: 10.1016/j.neo.2014.09.007
Li Y, Kang K, Krahn JM, Croutwater N, Lee K, Umbach DM, et al. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics. 2017;18(1):508.
pubmed: 28673244 pmcid: 5496318 doi: 10.1186/s12864-017-3906-0
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologist. 2022.
Moran S, Martínez-Cardús A, Sayols S, Musulén E, Balañá C, Estival-Gonzalez A, et al. Epigenetic profiling to classify cancer of unknown primary: a multicentre, retrospective analysis. Lancet Oncol. 2016;17(10):1386–95.
pubmed: 27575023 doi: 10.1016/S1470-2045(16)30297-2
Xu Q, Chen J, Ni S, Tan C, Xu M, Dong L, et al. Pan-cancer transcriptome analysis reveals a gene expression signature for the identification of tumor tissue origin. Mod Pathol. 2016;29(6):546–56.
pubmed: 26990976 doi: 10.1038/modpathol.2016.60
Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets—improving meta-analysis and prediction of prognosis. 2008;1(1):42.
Adlung L, Cohen Y, Mor U, Elinav E. Machine learning in clinical decision making. Med. 2021;2(6):642–65.
pubmed: 35590138 doi: 10.1016/j.medj.2021.04.006
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2006;8(1):118–27.
pubmed: 16632515 doi: 10.1093/biostatistics/kxj037
Wolpert DH, Macready WG. No free lunch theorems for optimization. TEVC. 1997;1(1):67–82.
Nygaard V, Rødland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2016;17(1):29–39.
pubmed: 26272994 doi: 10.1093/biostatistics/kxv027
Luijken K, Groenwold RHH, Van Calster B, Steyerberg EW, Van Smeden M. Impact of predictor measurement heterogeneity across settings on the performance of prediction models: a measurement error perspective. Stat Med. 2019;38(18):3444–59.
pubmed: 31148207 pmcid: 6619392 doi: 10.1002/sim.8183
Cao XH, Stojkovic I, Obradovic Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinform. 2016;17(1):359.
doi: 10.1186/s12859-016-1236-x
Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 2012;7(3):500–7.
pubmed: 22343431 pmcid: 3398141 doi: 10.1038/nprot.2011.457
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902.
pubmed: 25150836 pmcid: 4404308 doi: 10.1038/nbt.2931
Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B. Covariate shift by kernel mean matching. In: Quiñonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND, editors. Dataset shift in machine learning. MIT Press: Cambridge; 2008. p. 131–60.
doi: 10.7551/mitpress/7921.003.0013
Sugiyama M, Suzuki T, Nakajima S, Kashima H, Von Bünau P, Kawanabe M. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math. 2008;60(4):699–746.
doi: 10.1007/s10463-008-0197-x
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. Commun. ACM 2020;63(11).
Dincer AB, Janizek JD, Lee S. Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics. 2020;36:i573–82.
pubmed: 33381842 pmcid: 7773484 doi: 10.1093/bioinformatics/btaa796
Upadhyay U, Jain A. Removal of batch effects using generative adversarial networks. 2019.

Auteurs

Richard Van (R)

School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

Daniel Alvarez (D)

Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

Travis Mize (T)

Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA.

Sravani Gannavarapu (S)

Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

Lohitha Chintham Reddy (L)

Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

Fatma Nasoz (F)

Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.

Mira V Han (MV)

School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA. mira.han@unlv.edu.
Nevada Institute of Personalized Medicine, Las Vegas, NV, USA. mira.han@unlv.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH