Robustifying genomic classifiers to batch effects via ensemble learning.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
12 07 2021
Historique:
received: 15 02 2020
revised: 20 10 2020
accepted: 13 11 2020
pubmed: 28 11 2020
medline: 16 7 2021
entrez: 27 11 2020
Statut: ppublish

Résumé

Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 33245114
pii: 6007261
doi: 10.1093/bioinformatics/btaa986
pmc: PMC8485848
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

1521-1527

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM127430
Pays : United States
Organisme : Division of Mathematical Sciences, National Science Foundation (NSF-DMS)
ID : 1810829
Organisme : The National Cancer Institute
Organisme : National Institutes of Health (NIH-NCI)
ID : 4P30CA006516-51

Informations de copyright

© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Références

J Natl Cancer Inst. 2003 Jan 1;95(1):14-8
pubmed: 12509396
J Natl Cancer Inst. 2014 Apr 03;106(5):
pubmed: 24700803
Biostatistics. 2007 Jan;8(1):118-27
pubmed: 16632515
Bioinformatics. 2014 Jun 15;30(12):i105-12
pubmed: 24931973
Pac Symp Biocomput. 2020;25:451-462
pubmed: 31797618
N Engl J Med. 2015 Jul 16;373(3):243-51
pubmed: 25981554
Bioinformatics. 2004 Jan 1;20(1):105-14
pubmed: 14693816
Nucleic Acids Res. 2014 Dec 1;42(21):
pubmed: 25294822
N Engl J Med. 2014 May 01;370(18):1712-1723
pubmed: 24785206
Brief Bioinform. 2013 Jul;14(4):469-90
pubmed: 22851511
BMC Bioinformatics. 2018 Jul 13;19(1):262
pubmed: 30001694
BMC Med Genomics. 2012 Jun 08;5:23
pubmed: 22682473
Biomed Res Int. 2014;2014:651751
pubmed: 25101291
J Exp Med. 2005 Dec 19;202(12):1617-21
pubmed: 16365144
Nat Rev Genet. 2010 Oct;11(10):733-9
pubmed: 20838408
BJU Int. 2015 Mar;115(3):419-29
pubmed: 24784420
NAR Genom Bioinform. 2020 Sep;2(3):lqaa078
pubmed: 33015620
PLoS Genet. 2009 Oct;5(10):e1000612
pubmed: 19855822
Tuberculosis (Edinb). 2018 Mar;109:41-51
pubmed: 29559120
Nat Biotechnol. 2018 Jun;36(5):411-420
pubmed: 29608179
N Engl J Med. 2016 Jun 9;374(23):2209-2221
pubmed: 27276561
Pulm Med. 2013;2013:828939
pubmed: 23476764
Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):2578-2583
pubmed: 29531060
PLoS One. 2014 Oct 17;9(10):e110840
pubmed: 25330348
Biostatistics. 2012 Jul;13(3):539-52
pubmed: 22101192
Biostatistics. 2020 Apr 1;21(2):253-268
pubmed: 30202918
Science. 1999 Oct 15;286(5439):531-7
pubmed: 10521349
PLoS Genet. 2007 Sep;3(9):1724-35
pubmed: 17907809
Lancet. 2016 Jun 4;387(10035):2312-2322
pubmed: 27017310
Nat Biotechnol. 2014 Sep;32(9):896-902
pubmed: 25150836
Pharmacogenomics J. 2010 Aug;10(4):278-91
pubmed: 20676067

Auteurs

Yuqing Zhang (Y)

Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA.

Prasad Patil (P)

Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.

W Evan Johnson (WE)

Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA.

Giovanni Parmigiani (G)

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH