EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data.


Journal

Nucleic acids research
ISSN: 1362-4962
Titre abrégé: Nucleic Acids Res
Pays: England
ID NLM: 0411011

Informations de publication

Date de publication:
23 04 2019
Historique:
accepted: 25 01 2019
revised: 17 12 2018
received: 04 10 2018
pubmed: 6 2 2019
medline: 29 10 2019
entrez: 6 2 2019
Statut: ppublish

Résumé

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.

Identifiants

pubmed: 30722045
pii: 5306576
doi: 10.1093/nar/gkz068
pmc: PMC6468244
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e39

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.

Références

Nat Genet. 2011 May;43(5):491-8
pubmed: 21478889
Annu Rev Genet. 2011;45:203-26
pubmed: 21854229
Nat Biotechnol. 2011 May 08;29(6):512-20
pubmed: 21552272
BMC Bioinformatics. 2012 Aug 16;13:205
pubmed: 22897923
Nat Genet. 2007 Jul;39(7 Suppl):S37-42
pubmed: 17597780
Nat Rev Genet. 2011 May;12(5):363-76
pubmed: 21358748
Nucleic Acids Res. 2014 Jan;42(Database issue):D1001-6
pubmed: 24316577
Biostatistics. 2008 Jan;9(1):18-29
pubmed: 17513312
Am J Hum Genet. 2012 Oct 5;91(4):597-607
pubmed: 23040492
Nat Genet. 2017 Jul;49(7):1141-1147
pubmed: 28604732
Nat Rev Genet. 2010 Jun;11(6):446-50
pubmed: 20479774
Nat Genet. 2008 Oct;40(10):1166-74
pubmed: 18776908
Nat Genet. 2015 Mar;47(3):296-303
pubmed: 25621458
Nucleic Acids Res. 2007;35(6):2013-25
pubmed: 17341461
Nat Genet. 2008 Oct;40(10):1199-203
pubmed: 18776910
Nature. 2015 Oct 1;526(7571):75-81
pubmed: 26432246
Nat Commun. 2015 Feb 24;6:6304
pubmed: 25710614
Nat Genet. 2011 Mar;43(3):269-76
pubmed: 21317889
Hum Mol Genet. 2008 Oct 15;17(R2):R135-42
pubmed: 18852202
Ann Appl Stat. 2010 Dec 1;4(4):1749-1773
pubmed: 21572975
Nat Rev Genet. 2013 Jul;14(7):483-95
pubmed: 23752797
Genome Biol. 2014 Jun 26;15(6):R84
pubmed: 24970577
Science. 2004 Jul 23;305(5683):525-8
pubmed: 15273396
Nat Methods. 2009 Nov;6(11 Suppl):S13-20
pubmed: 19844226
Nat Protoc. 2014 Nov;9(11):2643-62
pubmed: 25321409
Science. 2016 Aug 19;353(6301):827-30
pubmed: 27540175
Hum Mol Genet. 2009 Apr 15;18(R1):R1-8
pubmed: 19297395
Biometrika. 2010 Sep;97(3):631-645
pubmed: 22822250
Genome Res. 2007 Nov;17(11):1665-74
pubmed: 17921354
Methods. 2016 Jun 1;102:36-49
pubmed: 26845461
Biostatistics. 2004 Oct;5(4):557-72
pubmed: 15475419
Curr Opin Genet Dev. 2009 Jun;19(3):196-204
pubmed: 19477115
Nucleic Acids Res. 2008 Nov;36(19):e126
pubmed: 18784189
Front Bioeng Biotechnol. 2015 Jun 25;3:92
pubmed: 26161383
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Sci Rep. 2017 Dec 4;7(1):16907
pubmed: 29203782
Am J Hum Genet. 2007 Sep;81(3):559-75
pubmed: 17701901
Bioinformatics. 2010 Feb 15;26(4):464-9
pubmed: 20031968
Nat Genet. 2017 May;49(5):692-699
pubmed: 28369037
Hum Hered. 2009;68(1):1-22
pubmed: 19339782
Genome Med. 2016 Jul 19;8(1):78
pubmed: 27435222
Nat Genet. 2008 Oct;40(10):1253-60
pubmed: 18776909

Auteurs

Zhongyang Zhang (Z)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Haoxiang Cheng (H)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Xiumei Hong (X)

Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA.

Antonio F Di Narzo (AF)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Oscar Franzen (O)

Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden.

Shouneng Peng (S)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Arno Ruusalepp (A)

Department of Cardiac Surgery, Tartu University Hospital, Tartu, Estonia.

Jason C Kovacic (JC)

Cardiovascular Research Center, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

Johan L M Bjorkegren (JLM)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Integrated Cardio Metabolic Centre, Department of Medicine, Karolinska Institutet, Karolinska Universitetssjukhuset, Huddinge, Sweden.

Xiaobin Wang (X)

Center on the Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Division of General Pediatrics & Adolescent Medicine, Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.

Ke Hao (K)

Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
The Tenth People's Hospital, Tongji University, Shanghai 200072, China.
College of Environmental Science and Engineering, Tongji University, Shanghai 200092, China.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH