LazyB: fast and cheap genome assembly.

Anchors Genome assembly Illumina sequencing Nanopore sequencing Spanning tree Unitigs

Journal

Algorithms for molecular biology : AMB

ISSN: 1748-7188

Titre abrégé: Algorithms Mol Biol

Pays: England

ID NLM: 101265088

Informations de publication

Date de publication:
01 Jun 2021

Historique:

received: 31 01 2021

accepted: 06 05 2021

entrez: 2 6 2021

pubmed: 3 6 2021

medline: 3 6 2021

Statut: epublish

Résumé

Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, "hybrid" methods that integrate short and long read data have been devised to address this need. LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. The LazyB prototype is available at https://github.com/TGatter/LazyB .

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort.

CONCLUSIONS CONCLUSIONS

LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements.

AVAILABILITY BACKGROUND

The LazyB prototype is available at https://github.com/TGatter/LazyB .

Identifiants

DOI: 10.1186/s13015-021-00186-5 PMID: 34074310 PMC: PMC8168326

pubmed: 34074310

doi: 10.1186/s13015-021-00186-5

pii: 10.1186/s13015-021-00186-5

pmc: PMC8168326

doi:

Types de publication

Journal Article

Langues

eng

Pagination

Subventions

Organisme : German Research Foundation DFS

ID : 850/19-2 within SPP 1738

Organisme : Bundesministerium für Bildung und Forschung

ID : BMBF 031L0164C

Organisme : Bundesministerium für Bildung und Forschung

ID : de.NBI-RBC

Organisme : RSF / Helmholtz Association programme

ID : 18-44-06201

Organisme : Deutscher Akademischer Austausch Dienst Kairo

ID : DAAD

Références

Genome Res. 2020 Sep;30(9):1291-1305

pubmed: 32801147

Bioinformatics. 2021 May 5;37(5):625-633

pubmed: 33051648

Bioinformatics. 2016 Apr 1;32(7):1009-15

pubmed: 26589280

Bioinformatics. 2020 Jun 1;36(12):3894-3896

pubmed: 32315402

Bioinformatics. 2013 Nov 1;29(21):2669-77

pubmed: 23990416

J Comput Biol. 2018 Jul;25(7):649-663

pubmed: 29461862

Genome Biol. 2020 Sep 14;21(1):245

pubmed: 32928274

Genome Res. 2017 May;27(5):737-746

pubmed: 28100585

Sci Rep. 2019 Mar 13;9(1):4318

pubmed: 30867495

Nat Commun. 2016 Nov 24;7:13637

pubmed: 27882922

NAR Genom Bioinform. 2020 May 25;2(2):lqaa037

pubmed: 33575591

Nat Methods. 2016 Dec;13(12):1050-1054

pubmed: 27749838

Nat Biotechnol. 2011 Nov 08;29(11):987-91

pubmed: 22068540

BMC Bioinformatics. 2021 Mar 16;22(1):124

pubmed: 33726674

Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85

pubmed: 16204131

G3 (Bethesda). 2018 Oct 3;8(10):3143-3154

pubmed: 30018084

Nucleic Acids Res. 2016 Nov 2;44(19):e147

pubmed: 27458204

Evol Bioinform Online. 2014 Dec 07;10:205-17

pubmed: 25574120

Genome Biol. 2020 Feb 7;21(1):30

pubmed: 32033565

PLoS Comput Biol. 2017 Jun 8;13(6):e1005595

pubmed: 28594827

Sci Rep. 2017 Aug 3;7(1):7213

pubmed: 28775309

J Comput Biol. 1995 Summer;2(2):291-306

pubmed: 7497130

Sci Rep. 2017 Jun 21;7(1):3935

pubmed: 28638050

Nat Biotechnol. 2019 May;37(5):540-546

pubmed: 30936562

Nature. 2018 Feb 1;554(7690):50-55

pubmed: 29364872

Nat Rev Genet. 2016 May 17;17(6):333-51

pubmed: 27184599

Nat Biotechnol. 2019 Oct;37(10):1155-1162

pubmed: 31406327

Genome Res. 2017 May;27(5):722-736

pubmed: 28298431

Nat Biotechnol. 2018 Oct 22;:

pubmed: 30346939

Nat Biotechnol. 2020 Sep;38(9):1044-1053

pubmed: 32686750

Nat Biotechnol. 2021 Apr;39(4):422-430

pubmed: 33318652

PLoS Comput Biol. 2018 Nov 20;14(11):e1006583

pubmed: 30458005

Gigascience. 2019 Dec 1;8(12):

pubmed: 31794015

Methods. 2020 Apr 1;176:14-24

pubmed: 31176772

Nat Methods. 2020 Feb;17(2):155-158

pubmed: 31819265

Bioinformatics. 2013 Apr 15;29(8):1072-5

pubmed: 23422339

iScience. 2020 Aug 21;23(8):101389

pubmed: 32781410

Brief Funct Genomics. 2012 Jan;11(1):25-37

pubmed: 22184334

Sci Rep. 2016 Aug 30;6:31900

pubmed: 27573208

Genome Res. 2009 Jun;19(6):1117-23

pubmed: 19251739

Bioinformatics. 2018 Sep 15;34(18):3094-3100

pubmed: 29750242

LazyB: fast and cheap genome assembly.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Pagination

Subventions

Références

Auteurs

Thomas Gatter (T)

Sarah von Löhneysen (S)

Jörg Fallmann (J)

Polina Drozdova (P)

Tom Hartmann (T)

Peter F Stadler (PF)

Classifications MeSH