Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method.

Batch effect Data integration Differentially expressed Metrology Multiomics Phenomics Prediction Quartet family Ratio Reference materials

Journal

Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660

Informations de publication

Date de publication:
07 09 2023
Historique:
received: 08 11 2022
accepted: 18 05 2023
medline: 8 9 2023
pubmed: 7 9 2023
entrez: 6 9 2023
Statut: epublish

Résumé

Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.

Sections du résumé

BACKGROUND
Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios.
RESULTS
As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies.
CONCLUSIONS
Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.

Identifiants

pubmed: 37674217
doi: 10.1186/s13059-023-03047-z
pii: 10.1186/s13059-023-03047-z
pmc: PMC10483871
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

201

Informations de copyright

© 2023. BioMed Central Ltd., part of Springer Nature.

Références

Nat Biotechnol. 2023 Sep 7;:
pubmed: 37679543
Genome Biol. 2023 Sep 7;24(1):201
pubmed: 37674217
Brief Bioinform. 2013 Jul;14(4):469-90
pubmed: 22851511
Genome Biol. 2020 Jan 16;21(1):12
pubmed: 31948481
J Proteomics. 2012 Jul 16;75(13):3938-51
pubmed: 22588121
BMC Biol. 2014 May 30;12:42
pubmed: 24885439
NAR Genom Bioinform. 2020 Sep;2(3):lqaa078
pubmed: 33015620
Mol Syst Biol. 2021 Aug;17(8):e10240
pubmed: 34432947
Nat Protoc. 2016 Sep;11(9):1650-67
pubmed: 27560171
Bioinformatics. 2012 Mar 15;28(6):882-3
pubmed: 22257669
Nat Biotechnol. 2014 Sep;32(9):888-95
pubmed: 25150837
Genome Biol. 2015 Jun 25;16:133
pubmed: 26109056
PLoS One. 2016 Jun 07;11(6):e0156594
pubmed: 27272489
Proteomics. 2014 Nov;14(21-22):2443-53
pubmed: 25211154
Biostatistics. 2007 Jan;8(1):118-27
pubmed: 16632515
Genome Biol. 2023 Sep 7;24(1):202
pubmed: 37674236
Nat Biotechnol. 2021 Sep;39(9):1103-1114
pubmed: 33349700
Trends Biotechnol. 2022 Sep;40(9):1029-1040
pubmed: 35282901
Biostatistics. 2018 Jan 1;19(1):71-86
pubmed: 28541380
BMC Genomics. 2017 Mar 14;18(Suppl 2):142
pubmed: 28361693
Nature. 2021 Dec;600(7889):368-369
pubmed: 34887584
Comput Struct Biotechnol J. 2022 Aug 12;20:4369-4375
pubmed: 36051874
Eur J Mass Spectrom (Chichester). 2020 Jun;26(3):165-174
pubmed: 32276547
Nat Methods. 2014 Mar;11(3):333-7
pubmed: 24464287
Mass Spectrom Rev. 2022 May;41(3):421-442
pubmed: 33238061
PLoS One. 2017 May 1;12(5):e0176278
pubmed: 28459819
Nat Methods. 2019 Dec;16(12):1289-1296
pubmed: 31740819
N Engl J Med. 2016 Aug 25;375(8):717-29
pubmed: 27557300
Nat Biotechnol. 2014 Sep;32(9):903-14
pubmed: 25150838
Biostatistics. 2016 Jan;17(1):29-39
pubmed: 26272994
BMC Bioinformatics. 2005 Jul 15;6 Suppl 2:S11
pubmed: 16026596
Anal Chem. 2017 Jan 3;89(1):656-665
pubmed: 27959516
Bioinformatics. 2019 Sep 15;35(18):3357-3364
pubmed: 30715209
Stat Appl Genet Mol Biol. 2012;11(3):Article 10
pubmed: 22611599
Aging Cell. 2011 Oct;10(5):868-78
pubmed: 21668623
Nat Biotechnol. 2010 Aug;28(8):827-38
pubmed: 20676074
Nat Biotechnol. 2023 Sep 7;:
pubmed: 37679545
Nat Commun. 2021 Aug 17;12(1):4992
pubmed: 34404777
Brief Bioinform. 2019 Jan 18;20(1):347-355
pubmed: 30657890
Anal Chem. 2015 Apr 7;87(7):3606-15
pubmed: 25692814
Proc Natl Acad Sci U S A. 2013 Mar 12;110(11):4245-50
pubmed: 23431203
Bioinformatics. 2022 Apr 28;38(9):2657-2658
pubmed: 35238331
Nat Methods. 2013 Jul;10(7):623-9
pubmed: 23685885
Trends Biotechnol. 2017 Jun;35(6):498-507
pubmed: 28351613
Nat Biotechnol. 2023 Jan;41(1):82-95
pubmed: 36109686
J Genet Genomics. 2019 Sep 20;46(9):433-443
pubmed: 31611172
Nat Biotechnol. 2006 Sep;24(9):1162-9
pubmed: 17061323
Nat Commun. 2020 Jul 30;11(1):3793
pubmed: 32732981
Nat Biotechnol. 2014 Sep;32(9):896-902
pubmed: 25150836
Pharmacogenomics J. 2010 Aug;10(4):278-91
pubmed: 20676067
PLoS Biol. 2015 Jun 09;13(6):e1002165
pubmed: 26057340
Genome Biol. 2023 Oct 26;24(1):245
pubmed: 37884999
Nat Rev Genet. 2010 Oct;11(10):733-9
pubmed: 20838408
Genome Biol. 2014 Dec 03;15(12):523
pubmed: 25633159
Nat Biotechnol. 2017 May 9;35(5):409-412
pubmed: 28486446
Nat Biotechnol. 2006 Sep;24(9):1151-61
pubmed: 16964229

Auteurs

Ying Yu (Y)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Naixin Zhang (N)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Yuanbang Mai (Y)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Luyao Ren (L)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Qiaochu Chen (Q)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Zehui Cao (Z)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Qingwang Chen (Q)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Yaqing Liu (Y)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Wanwan Hou (W)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.

Jingcheng Yang (J)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China.

Huixiao Hong (H)

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA.

Joshua Xu (J)

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA.

Weida Tong (W)

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA.

Lianhua Dong (L)

National Institute of Metrology, Beijing, China.

Leming Shi (L)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China. lemingshi@fudan.edu.cn.
International Human Phenome Institutes, Shanghai, China. lemingshi@fudan.edu.cn.

Xiang Fang (X)

National Institute of Metrology, Beijing, China. fangxiang@nim.ac.cn.

Yuanting Zheng (Y)

State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China. zhengyuanting@fudan.edu.cn.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature
Humans Algorithms Software Artificial Intelligence Computer Simulation

Classifications MeSH