ntEdit: scalable genome sequence polishing.

Animals Genome, Human Genomics Haploidy High-Throughput Nucleotide Sequencing Humans Sequence Analysis, DNA Software

Journal

Bioinformatics (Oxford, England)

ISSN: 1367-4811

Titre abrégé: Bioinformatics

Pays: England

ID NLM: 9808944

Informations de publication

Date de publication:
01 11 2019

Historique:

received: 05 12 2018

revised: 04 03 2019

accepted: 07 05 2019

pubmed: 17 5 2019

medline: 1 7 2020

entrez: 17 5 2019

Statut: ppublish

Résumé

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. https://github.com/bcgsc/ntedit. Supplementary data are available at Bioinformatics online.

Identifiants

DOI: 10.1093/bioinformatics/btz400 PMID: 31095290 PMC: PMC6821332

pubmed: 31095290

pii: 5490204

doi: 10.1093/bioinformatics/btz400

pmc: PMC6821332

doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

4430-4432

Subventions

Organisme : NHGRI NIH HHS

ID : R01 HG007182

Pays : United States

Informations de copyright

Références

Bioinformatics. 2018 Jul 1;34(13):i142-i150

pubmed: 29949969

Hum Genomics. 2016 Jul 25;10 Suppl 2:20

pubmed: 27461106

Genome Res. 2010 Sep;20(9):1297-303

pubmed: 20644199

Genome Res. 2017 May;27(5):737-746

pubmed: 28100585

Bioinformatics. 2015 Oct 1;31(19):3210-2

pubmed: 26059717

Plant J. 2015 Jul;83(2):189-212

pubmed: 26017574

Bioinformatics. 2013 Jun 15;29(12):1492-7

pubmed: 23698863

Bioinformatics. 2017 May 1;33(9):1324-1330

pubmed: 28453674

Nat Methods. 2015 Aug;12(8):780-6

pubmed: 26121404

Nat Biotechnol. 2018 Apr;36(4):338-345

pubmed: 29431738

PLoS One. 2014 Nov 19;9(11):e112963

pubmed: 25409509

Nat Biotechnol. 2019 Feb;37(2):124-126

pubmed: 30670796

Nat Biotechnol. 2019 Feb;37(2):127-128

pubmed: 30670797

ntEdit: scalable genome sequence polishing.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

René L Warren (RL)

Lauren Coombe (L)

Hamid Mohamadi (H)

Jessica Zhang (J)

Barry Jaquish (B)

Nathalie Isabel (N)

Steven J M Jones (SJM)

Jean Bousquet (J)

Joerg Bohlmann (J)

Inanç Birol (I)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH