NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data.


Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
01 07 2021
Historique:
received: 21 12 2020
revised: 04 05 2021
accepted: 07 06 2021
entrez: 1 7 2021
pubmed: 2 7 2021
medline: 15 3 2022
Statut: ppublish

Résumé

Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box" that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.

Sections du résumé

BACKGROUND
Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases.
RESULTS
We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints.
CONCLUSIONS
Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box" that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.

Identifiants

pubmed: 34195837
pii: 6312216
doi: 10.1093/gigascience/giab046
pmc: PMC8246072
pii:
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : NIGMS NIH HHS
ID : P20 GM103449
Pays : United States
Organisme : NHLBI NIH HHS
ID : U01 HL153009
Pays : United States
Organisme : NHLBI NIH HHS
ID : UM1 HL098123
Pays : United States

Informations de copyright

© The Author(s) 2021. Published by Oxford University Press GigaScience.

Références

BMC Bioinformatics. 2014 Jun 10;15:180
pubmed: 24915764
Genome Res. 2017 May;27(5):677-685
pubmed: 27895111
Cell. 2019 Jan 24;176(3):663-675.e19
pubmed: 30661756
Genome Res. 2017 Jan;27(1):157-164
pubmed: 27903644
Bioinformatics. 2014 Sep 1;30(17):2503-5
pubmed: 24812344
Nucleic Acids Res. 2013 Jan;41(Database issue):D936-41
pubmed: 23193291
Bioinformatics. 2012 Nov 1;28(21):2711-8
pubmed: 22942022
Nat Methods. 2015 Oct;12(10):966-8
pubmed: 26258291
Nat Rev Genet. 2011 May;12(5):363-76
pubmed: 21358748
Bioinformatics. 2015 Jun 15;31(12):2032-4
pubmed: 25697820
Nat Rev Genet. 2013 Feb;14(2):125-38
pubmed: 23329113
Bioinformatics. 2017 Mar 1;33(5):751-753
pubmed: 28011768
Gigascience. 2018 Jul 1;7(7):
pubmed: 29860504
Bioinformatics. 2018 May 15;34(10):1774-1777
pubmed: 29300834
Bioinformatics. 2012 Sep 15;28(18):i333-i339
pubmed: 22962449
Genome Biol. 2019 Dec 19;20(1):291
pubmed: 31856913
Nat Genet. 2015 Mar;47(3):296-303
pubmed: 25621458
Nat Biotechnol. 2020 Nov;38(11):1347-1355
pubmed: 32541955
Nature. 2015 Oct 1;526(7571):75-81
pubmed: 26432246
Nat Methods. 2018 Jun;15(6):461-468
pubmed: 29713083
Bioinformatics. 2012 Feb 15;28(4):593-4
pubmed: 22199392
Gigascience. 2021 Feb 16;10(2):
pubmed: 33590861
Genome Biol. 2014 Jun 26;15(6):R84
pubmed: 24970577
Gigascience. 2021 Jul 1;10(7):
pubmed: 34195837
Am J Hum Genet. 2016 Apr 7;98(4):667-79
pubmed: 27018473
Nat Commun. 2019 Nov 27;10(1):5402
pubmed: 31776332
Bioinformatics. 2016 Apr 15;32(8):1220-2
pubmed: 26647377
Nature. 2020 May;581(7809):444-451
pubmed: 32461652
Genome Res. 2015 Jun;25(6):792-801
pubmed: 25883321
Nat Rev Genet. 2016 May 17;17(6):333-51
pubmed: 27184599
Genome Biol. 2020 Feb 12;21(1):35
pubmed: 32051000
Methods. 2016 Jun 1;102:36-49
pubmed: 26845461
Genome Biol. 2019 Nov 20;20(1):246
pubmed: 31747936
Gigascience. 2019 Sep 1;8(9):
pubmed: 31494671
BMC Genomics. 2016 Jan 16;17:64
pubmed: 26772178
Front Bioeng Biotechnol. 2015 Jun 25;3:92
pubmed: 26161383
Gigascience. 2019 Apr 1;8(4):
pubmed: 31222198
Bioinformatics. 2015 Aug 15;31(16):2741-4
pubmed: 25861968
Nature. 2011 Feb 3;470(7332):59-65
pubmed: 21293372
Bioinformatics. 2015 Dec 15;31(24):3994-6
pubmed: 26286809
Sci Data. 2016 Jun 07;3:160025
pubmed: 27271295
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Gigascience. 2017 Nov 1;6(11):1-6
pubmed: 29048539
Genome Biol. 2019 Jun 3;20(1):117
pubmed: 31159850
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278
PLoS One. 2014 Nov 25;9(11):e113324
pubmed: 25423315

Auteurs

Michael D Linderman (MD)

Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA.

Crystal Paudyal (C)

Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA.

Musab Shakeel (M)

Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA.

William Kelley (W)

Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA.

Ali Bashir (A)

Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA.

Bruce D Gelb (BD)

Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave Levy Place, Box 1040, New York, NY 10029, USA.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH