Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers.


Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
16 10 2019
Historique:
received: 22 03 2019
accepted: 16 09 2019
entrez: 18 10 2019
pubmed: 18 10 2019
medline: 3 11 2020
Statut: epublish

Résumé

Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to "patch" a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Identifiants

pubmed: 31619717
doi: 10.1038/s41598-019-51284-9
pii: 10.1038/s41598-019-51284-9
pmc: PMC6795807
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

14882

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | Center for Information Technology (Center for Information Technology, National Institutes of Health)
ID : 1R01AI123037
Pays : International

Références

PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2012 Jun 1;28(11):1420-8
pubmed: 22495754
Nat Biotechnol. 2011 Nov 08;29(11):987-91
pubmed: 22068540
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
J Comput Biol. 2010 Nov;17(11):1519-33
pubmed: 20958248
BMC Bioinformatics. 2011 Aug 25;12:354
pubmed: 21867511
Nat Methods. 2017 Nov;14(11):1063-1071
pubmed: 28967888
Brief Bioinform. 2019 Jul 19;20(4):1151-1159
pubmed: 29028869
Genome Res. 2012 Mar;22(3):549-56
pubmed: 22156294
Gigascience. 2012 Dec 27;1(1):18
pubmed: 23587118
Brief Bioinform. 2019 Jan 18;20(1):235-244
pubmed: 28968781
Genome Res. 2008 May;18(5):821-9
pubmed: 18349386
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
Bioinformatics. 2011 Jul 1;27(13):i94-101
pubmed: 21685107
Nucleic Acids Res. 2012 Nov 1;40(20):e155
pubmed: 22821567
Bioinformatics. 2013 Apr 15;29(8):1072-5
pubmed: 23422339
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8
pubmed: 21187386

Auteurs

Kanak Mahadik (K)

Adobe Research, San Jose, USA. mahadik@adobe.com.

Christopher Wright (C)

Purdue University, West Lafayette, IN, USA.

Milind Kulkarni (M)

Purdue University, West Lafayette, IN, USA.

Saurabh Bagchi (S)

Purdue University, West Lafayette, IN, USA.

Somali Chaterji (S)

Purdue University, West Lafayette, IN, USA. schaterji@schaterji.io.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH