Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers.
Journal
Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288
Informations de publication
Date de publication:
16 10 2019
16 10 2019
Historique:
received:
22
03
2019
accepted:
16
09
2019
entrez:
18
10
2019
pubmed:
18
10
2019
medline:
3
11
2020
Statut:
epublish
Résumé
Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to "patch" a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).
Identifiants
pubmed: 31619717
doi: 10.1038/s41598-019-51284-9
pii: 10.1038/s41598-019-51284-9
pmc: PMC6795807
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
14882Subventions
Organisme : U.S. Department of Health & Human Services | NIH | Center for Information Technology (Center for Information Technology, National Institutes of Health)
ID : 1R01AI123037
Pays : International
Références
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2012 Jun 1;28(11):1420-8
pubmed: 22495754
Nat Biotechnol. 2011 Nov 08;29(11):987-91
pubmed: 22068540
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
J Comput Biol. 2010 Nov;17(11):1519-33
pubmed: 20958248
BMC Bioinformatics. 2011 Aug 25;12:354
pubmed: 21867511
Nat Methods. 2017 Nov;14(11):1063-1071
pubmed: 28967888
Brief Bioinform. 2019 Jul 19;20(4):1151-1159
pubmed: 29028869
Genome Res. 2012 Mar;22(3):549-56
pubmed: 22156294
Gigascience. 2012 Dec 27;1(1):18
pubmed: 23587118
Brief Bioinform. 2019 Jan 18;20(1):235-244
pubmed: 28968781
Genome Res. 2008 May;18(5):821-9
pubmed: 18349386
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
Bioinformatics. 2011 Jul 1;27(13):i94-101
pubmed: 21685107
Nucleic Acids Res. 2012 Nov 1;40(20):e155
pubmed: 22821567
Bioinformatics. 2013 Apr 15;29(8):1072-5
pubmed: 23422339
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8
pubmed: 21187386