Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data.

Cross-platform normalization FSMVN FSQN Feature selection Mean Microarray Molecular classification Quantile normalization RNAseq Variance

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
29 Mar 2024
Historique:
received: 06 12 2023
accepted: 20 03 2024
medline: 29 3 2024
pubmed: 29 3 2024
entrez: 29 3 2024
Statut: epublish

Résumé

Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection. FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases. In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.

Sections du résumé

BACKGROUND BACKGROUND
Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection.
RESULTS RESULTS
FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases.
CONCLUSIONS CONCLUSIONS
In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.

Identifiants

pubmed: 38549046
doi: 10.1186/s12859-024-05759-w
pii: 10.1186/s12859-024-05759-w
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

136

Informations de copyright

© 2024. The Author(s).

Références

Bernard PS, Parker JS, Mullins M, Cheung MCU, Leung S, Voduc D, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–7.
doi: 10.1200/JCO.2008.18.1370 pubmed: 19204204 pmcid: 2667820
Yang X, Kui L, Tang M, Li D, Wei K, Chen W, et al. High-throughput transcriptome profiling in drug and biomarker discovery. Front Genet. 2020;11:505377.
Soret P, Le Dantec C, Desvaux E, Foulquier N, Chassagnol B, Hubert S, et al. A new molecular classification to drive precision treatment strategies in primary Sjögren’s syndrome. Nat Commun. 2021;12:3523.
doi: 10.1038/s41467-021-23472-7 pubmed: 34112769 pmcid: 8192578
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173:291-304.e6.
doi: 10.1016/j.cell.2018.03.022 pubmed: 29625048 pmcid: 5957518
Marisa L, Blum Y, Taieb J, Ayadi M, Pilati C, Le Malicot K, et al. Intratumor CMS heterogeneity impacts patient prognosis in localized colon cancer. Clin Cancer Res. 2021;27:4768–80.
doi: 10.1158/1078-0432.CCR-21-0529 pubmed: 34168047 pmcid: 8974433
Cristescu R, Lee J, Nebozhyn M, Kim K-M, Ting JC, Wong SS, et al. Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes. Nat Med. 2015;21:449–56.
doi: 10.1038/nm.3850 pubmed: 25894828
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. 2014.
Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM, et al. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceed Nat Acad Sci. 2003;100(17):9991–6.
doi: 10.1073/pnas.1732008100
Sohn BH, Hwang JE, Jang HJ, Lee HS, Oh SC, Shim JJ, et al. Clinical significance of four molecular subtypes of gastric cancer identified by The Cancer Genome Atlas project. Clin Cancer Res. 2017;23:4441–9.
doi: 10.1158/1078-0432.CCR-16-2211 pubmed: 28747339 pmcid: 5785562
Oh SC, Sohn BH, Cheong JH, Kim SB, Lee JE, Park KC, et al. Clinical and genomic landscape of gastric cancer with a mesenchymal phenotype. Nat Commun. 2018;9(1):1777.
doi: 10.1038/s41467-018-04179-8 pubmed: 29725014 pmcid: 5934392
Franks JM, Cai G, Whitfield ML. Gene expression Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics. 2018;34(11):1868–74.
doi: 10.1093/bioinformatics/bty026 pubmed: 29360996 pmcid: 5972664
Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.
doi: 10.1093/bioinformatics/19.2.185 pubmed: 12538238
Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ. 2016;4:e1621.
doi: 10.7717/peerj.1621 pubmed: 26844019 pmcid: 4736986
Liu H, Lafferty J, Wasserman L, Wainwright MJ. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. 2009.
Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol. 2023;6(1):222.
doi: 10.1038/s42003-023-04588-6 pubmed: 36841852 pmcid: 9968332
Koboldt DC, Fulton RS, McLellan MD, Schmidt H, Kalicki-Veizer J, McMichael JF, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70.
doi: 10.1038/nature11412
Muzny DM, Bainbridge MN, Chang K, Dinh HH, Drummond JA, Fowler G, et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487:330–7.
doi: 10.1038/nature11252
Ray P, Reddy SS, Banerjee T. Various dimension reduction techniques for high dimensional data analysis: a review. Artif Intell Rev. 2021;54:3473–515.
doi: 10.1007/s10462-020-09928-0
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2:401–4.
doi: 10.1158/2159-8290.CD-12-0095 pubmed: 22588877
Parrish N, Hormozdiari F, Eskin E. Assembly of non-unique insertion content using next-generation sequencing. Bioinform: Impact Accurate Quant Prot Genet Anal Res. 2014;12(Suppl6):S3.
Pagès H, Carlson M, Falcon S, Li N. AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. 2022.
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375:1109–12.
doi: 10.1056/NEJMp1607591 pubmed: 27653561 pmcid: 6309165
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160.
doi: 10.1200/JCO.2008.18.1370 pubmed: 19204204 pmcid: 2667820
Guinney J, Dienstmann R, Wang X, De Reyniès A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nat Med. 2015;21:1350–6.
doi: 10.1038/nm.3967 pubmed: 26457759 pmcid: 4636487
Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:1–26.
doi: 10.18637/jss.v028.i05
Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS ONE. 2019;14(11):e0224365.
doi: 10.1371/journal.pone.0224365 pubmed: 31697686 pmcid: 6837442
Diamantidis NA, Karlis D, Giakoumakis EA. Unsupervised stratification of cross-validation for accuracy estimation. Artif Intell. 2000;116:1–16.
doi: 10.1016/S0004-3702(99)00094-6
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2023.
Friedman JH, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.
doi: 10.18637/jss.v033.i01 pubmed: 20808728 pmcid: 2929880
Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer; 2016.
doi: 10.1007/978-3-319-24277-4
Hastie T, Tibshirani R, Narasimhan B, Chu G. impute: imputation for microarray data . 2023.
van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67.
doi: 10.18637/jss.v045.i03

Auteurs

Daniel Skubleny (D)

Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada. skubleny@ualberta.ca.

Sunita Ghosh (S)

Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
Department of Mathematical and Statistical Sciences, Faculty of Science, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

Jennifer Spratlin (J)

Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

Daniel E Schiller (DE)

Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

Gina R Rayat (GR)

Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.

Classifications MeSH