LazyB: fast and cheap genome assembly.

Anchors Genome assembly Illumina sequencing Nanopore sequencing Spanning tree Unitigs

Journal

Algorithms for molecular biology : AMB
ISSN: 1748-7188
Titre abrégé: Algorithms Mol Biol
Pays: England
ID NLM: 101265088

Informations de publication

Date de publication:
01 Jun 2021
Historique:
received: 31 01 2021
accepted: 06 05 2021
entrez: 2 6 2021
pubmed: 3 6 2021
medline: 3 6 2021
Statut: epublish

Résumé

Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, "hybrid" methods that integrate short and long read data have been devised to address this need. LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. The LazyB prototype is available at https://github.com/TGatter/LazyB .

Sections du résumé

BACKGROUND BACKGROUND
Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, "hybrid" methods that integrate short and long read data have been devised to address this need.
RESULTS RESULTS
LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort.
CONCLUSIONS CONCLUSIONS
LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements.
AVAILABILITY BACKGROUND
The LazyB prototype is available at https://github.com/TGatter/LazyB .

Identifiants

pubmed: 34074310
doi: 10.1186/s13015-021-00186-5
pii: 10.1186/s13015-021-00186-5
pmc: PMC8168326
doi:

Types de publication

Journal Article

Langues

eng

Pagination

8

Subventions

Organisme : German Research Foundation DFS
ID : 850/19-2 within SPP 1738
Organisme : Bundesministerium für Bildung und Forschung
ID : BMBF 031L0164C
Organisme : Bundesministerium für Bildung und Forschung
ID : de.NBI-RBC
Organisme : RSF / Helmholtz Association programme
ID : 18-44-06201
Organisme : Deutscher Akademischer Austausch Dienst Kairo
ID : DAAD

Références

Genome Res. 2020 Sep;30(9):1291-1305
pubmed: 32801147
Bioinformatics. 2021 May 5;37(5):625-633
pubmed: 33051648
Bioinformatics. 2016 Apr 1;32(7):1009-15
pubmed: 26589280
Bioinformatics. 2020 Jun 1;36(12):3894-3896
pubmed: 32315402
Bioinformatics. 2013 Nov 1;29(21):2669-77
pubmed: 23990416
J Comput Biol. 2018 Jul;25(7):649-663
pubmed: 29461862
Genome Biol. 2020 Sep 14;21(1):245
pubmed: 32928274
Genome Res. 2017 May;27(5):737-746
pubmed: 28100585
Sci Rep. 2019 Mar 13;9(1):4318
pubmed: 30867495
Nat Commun. 2016 Nov 24;7:13637
pubmed: 27882922
NAR Genom Bioinform. 2020 May 25;2(2):lqaa037
pubmed: 33575591
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838
Nat Biotechnol. 2011 Nov 08;29(11):987-91
pubmed: 22068540
BMC Bioinformatics. 2021 Mar 16;22(1):124
pubmed: 33726674
Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85
pubmed: 16204131
G3 (Bethesda). 2018 Oct 3;8(10):3143-3154
pubmed: 30018084
Nucleic Acids Res. 2016 Nov 2;44(19):e147
pubmed: 27458204
Evol Bioinform Online. 2014 Dec 07;10:205-17
pubmed: 25574120
Genome Biol. 2020 Feb 7;21(1):30
pubmed: 32033565
PLoS Comput Biol. 2017 Jun 8;13(6):e1005595
pubmed: 28594827
Sci Rep. 2017 Aug 3;7(1):7213
pubmed: 28775309
J Comput Biol. 1995 Summer;2(2):291-306
pubmed: 7497130
Sci Rep. 2017 Jun 21;7(1):3935
pubmed: 28638050
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Nature. 2018 Feb 1;554(7690):50-55
pubmed: 29364872
Nat Rev Genet. 2016 May 17;17(6):333-51
pubmed: 27184599
Nat Biotechnol. 2019 Oct;37(10):1155-1162
pubmed: 31406327
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Nat Biotechnol. 2018 Oct 22;:
pubmed: 30346939
Nat Biotechnol. 2020 Sep;38(9):1044-1053
pubmed: 32686750
Nat Biotechnol. 2021 Apr;39(4):422-430
pubmed: 33318652
PLoS Comput Biol. 2018 Nov 20;14(11):e1006583
pubmed: 30458005
Gigascience. 2019 Dec 1;8(12):
pubmed: 31794015
Methods. 2020 Apr 1;176:14-24
pubmed: 31176772
Nat Methods. 2020 Feb;17(2):155-158
pubmed: 31819265
Bioinformatics. 2013 Apr 15;29(8):1072-5
pubmed: 23422339
iScience. 2020 Aug 21;23(8):101389
pubmed: 32781410
Brief Funct Genomics. 2012 Jan;11(1):25-37
pubmed: 22184334
Sci Rep. 2016 Aug 30;6:31900
pubmed: 27573208
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242

Auteurs

Thomas Gatter (T)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany. thomas@bioinf.uni-leipzig.de.

Sarah von Löhneysen (S)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.

Jörg Fallmann (J)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.

Polina Drozdova (P)

Institute of Biology, Irkutsk State University, RU-664003, Irkutsk, Russia.

Tom Hartmann (T)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.

Peter F Stadler (PF)

Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá, D.C, Colombia. studla@bioinf.uni-leipzig.de.
Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany. studla@bioinf.uni-leipzig.de.
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103, Leipzig, Germany. studla@bioinf.uni-leipzig.de.
Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090, Vienna, Austria. studla@bioinf.uni-leipzig.de.
Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501, USA. studla@bioinf.uni-leipzig.de.

Classifications MeSH