Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data.
Cross-platform normalization
FSMVN
FSQN
Feature selection
Mean
Microarray
Molecular classification
Quantile normalization
RNAseq
Variance
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
29 Mar 2024
29 Mar 2024
Historique:
received:
06
12
2023
accepted:
20
03
2024
medline:
29
3
2024
pubmed:
29
3
2024
entrez:
29
3
2024
Statut:
epublish
Résumé
Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection. FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases. In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.
Sections du résumé
BACKGROUND
BACKGROUND
Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection.
RESULTS
RESULTS
FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases.
CONCLUSIONS
CONCLUSIONS
In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.
Identifiants
pubmed: 38549046
doi: 10.1186/s12859-024-05759-w
pii: 10.1186/s12859-024-05759-w
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
136Informations de copyright
© 2024. The Author(s).
Références
Bernard PS, Parker JS, Mullins M, Cheung MCU, Leung S, Voduc D, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–7.
doi: 10.1200/JCO.2008.18.1370
pubmed: 19204204
pmcid: 2667820
Yang X, Kui L, Tang M, Li D, Wei K, Chen W, et al. High-throughput transcriptome profiling in drug and biomarker discovery. Front Genet. 2020;11:505377.
Soret P, Le Dantec C, Desvaux E, Foulquier N, Chassagnol B, Hubert S, et al. A new molecular classification to drive precision treatment strategies in primary Sjögren’s syndrome. Nat Commun. 2021;12:3523.
doi: 10.1038/s41467-021-23472-7
pubmed: 34112769
pmcid: 8192578
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173:291-304.e6.
doi: 10.1016/j.cell.2018.03.022
pubmed: 29625048
pmcid: 5957518
Marisa L, Blum Y, Taieb J, Ayadi M, Pilati C, Le Malicot K, et al. Intratumor CMS heterogeneity impacts patient prognosis in localized colon cancer. Clin Cancer Res. 2021;27:4768–80.
doi: 10.1158/1078-0432.CCR-21-0529
pubmed: 34168047
pmcid: 8974433
Cristescu R, Lee J, Nebozhyn M, Kim K-M, Ting JC, Wong SS, et al. Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes. Nat Med. 2015;21:449–56.
doi: 10.1038/nm.3850
pubmed: 25894828
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. 2014.
Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM, et al. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceed Nat Acad Sci. 2003;100(17):9991–6.
doi: 10.1073/pnas.1732008100
Sohn BH, Hwang JE, Jang HJ, Lee HS, Oh SC, Shim JJ, et al. Clinical significance of four molecular subtypes of gastric cancer identified by The Cancer Genome Atlas project. Clin Cancer Res. 2017;23:4441–9.
doi: 10.1158/1078-0432.CCR-16-2211
pubmed: 28747339
pmcid: 5785562
Oh SC, Sohn BH, Cheong JH, Kim SB, Lee JE, Park KC, et al. Clinical and genomic landscape of gastric cancer with a mesenchymal phenotype. Nat Commun. 2018;9(1):1777.
doi: 10.1038/s41467-018-04179-8
pubmed: 29725014
pmcid: 5934392
Franks JM, Cai G, Whitfield ML. Gene expression Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics. 2018;34(11):1868–74.
doi: 10.1093/bioinformatics/bty026
pubmed: 29360996
pmcid: 5972664
Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.
doi: 10.1093/bioinformatics/19.2.185
pubmed: 12538238
Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ. 2016;4:e1621.
doi: 10.7717/peerj.1621
pubmed: 26844019
pmcid: 4736986
Liu H, Lafferty J, Wasserman L, Wainwright MJ. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. 2009.
Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol. 2023;6(1):222.
doi: 10.1038/s42003-023-04588-6
pubmed: 36841852
pmcid: 9968332
Koboldt DC, Fulton RS, McLellan MD, Schmidt H, Kalicki-Veizer J, McMichael JF, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70.
doi: 10.1038/nature11412
Muzny DM, Bainbridge MN, Chang K, Dinh HH, Drummond JA, Fowler G, et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487:330–7.
doi: 10.1038/nature11252
Ray P, Reddy SS, Banerjee T. Various dimension reduction techniques for high dimensional data analysis: a review. Artif Intell Rev. 2021;54:3473–515.
doi: 10.1007/s10462-020-09928-0
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2:401–4.
doi: 10.1158/2159-8290.CD-12-0095
pubmed: 22588877
Parrish N, Hormozdiari F, Eskin E. Assembly of non-unique insertion content using next-generation sequencing. Bioinform: Impact Accurate Quant Prot Genet Anal Res. 2014;12(Suppl6):S3.
Pagès H, Carlson M, Falcon S, Li N. AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. 2022.
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375:1109–12.
doi: 10.1056/NEJMp1607591
pubmed: 27653561
pmcid: 6309165
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160.
doi: 10.1200/JCO.2008.18.1370
pubmed: 19204204
pmcid: 2667820
Guinney J, Dienstmann R, Wang X, De Reyniès A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nat Med. 2015;21:1350–6.
doi: 10.1038/nm.3967
pubmed: 26457759
pmcid: 4636487
Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:1–26.
doi: 10.18637/jss.v028.i05
Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS ONE. 2019;14(11):e0224365.
doi: 10.1371/journal.pone.0224365
pubmed: 31697686
pmcid: 6837442
Diamantidis NA, Karlis D, Giakoumakis EA. Unsupervised stratification of cross-validation for accuracy estimation. Artif Intell. 2000;116:1–16.
doi: 10.1016/S0004-3702(99)00094-6
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2023.
Friedman JH, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.
doi: 10.18637/jss.v033.i01
pubmed: 20808728
pmcid: 2929880
Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer; 2016.
doi: 10.1007/978-3-319-24277-4
Hastie T, Tibshirani R, Narasimhan B, Chu G. impute: imputation for microarray data . 2023.
van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67.
doi: 10.18637/jss.v045.i03