ntEdit: scalable genome sequence polishing.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
01 11 2019
Historique:
received: 05 12 2018
revised: 04 03 2019
accepted: 07 05 2019
pubmed: 17 5 2019
medline: 1 7 2020
entrez: 17 5 2019
Statut: ppublish

Résumé

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. https://github.com/bcgsc/ntedit. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 31095290
pii: 5490204
doi: 10.1093/bioinformatics/btz400
pmc: PMC6821332
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

4430-4432

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG007182
Pays : United States

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press.

Références

Bioinformatics. 2018 Jul 1;34(13):i142-i150
pubmed: 29949969
Hum Genomics. 2016 Jul 25;10 Suppl 2:20
pubmed: 27461106
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Genome Res. 2017 May;27(5):737-746
pubmed: 28100585
Bioinformatics. 2015 Oct 1;31(19):3210-2
pubmed: 26059717
Plant J. 2015 Jul;83(2):189-212
pubmed: 26017574
Bioinformatics. 2013 Jun 15;29(12):1492-7
pubmed: 23698863
Bioinformatics. 2017 May 1;33(9):1324-1330
pubmed: 28453674
Nat Methods. 2015 Aug;12(8):780-6
pubmed: 26121404
Nat Biotechnol. 2018 Apr;36(4):338-345
pubmed: 29431738
PLoS One. 2014 Nov 19;9(11):e112963
pubmed: 25409509
Nat Biotechnol. 2019 Feb;37(2):124-126
pubmed: 30670796
Nat Biotechnol. 2019 Feb;37(2):127-128
pubmed: 30670797

Auteurs

René L Warren (RL)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Lauren Coombe (L)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Hamid Mohamadi (H)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Jessica Zhang (J)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Barry Jaquish (B)

BC Ministry of Forests, Lands, and Natural Resource Operations, Victoria, Canada.

Nathalie Isabel (N)

Laurentian Forestry Centre, Natural Resources Canada, Québec, Canada.

Steven J M Jones (SJM)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Jean Bousquet (J)

Canada Research Chair in Forest Genomics, Université Laval, Québec, Canada.

Joerg Bohlmann (J)

Michael Smith Laboratories, University of British Columbia, Vancouver, Canada.

Inanç Birol (I)

Genome Sciences Centre, BC Cancer, Vancouver, Canada.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH