Pan-genome de Bruijn graph using the bidirectional FM-index.

Approximate pattern matching Lossless alignment Pan-genome visualization Search schemes Sequence-to-graph alignment

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
26 Oct 2023
Historique:
received: 13 02 2023
accepted: 12 10 2023
medline: 30 10 2023
pubmed: 27 10 2023
entrez: 26 10 2023
Statut: epublish

Résumé

Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.

Sections du résumé

BACKGROUND BACKGROUND
Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance.
RESULTS RESULTS
We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment.
CONCLUSIONS CONCLUSIONS
We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.

Identifiants

pubmed: 37884897
doi: 10.1186/s12859-023-05531-6
pii: 10.1186/s12859-023-05531-6
pmc: PMC10605969
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

400

Subventions

Organisme : Fonds Wetenschappelijk Onderzoek
ID : 1117322N
Organisme : Fonds Wetenschappelijk Onderzoek
ID : 1SE7822N

Informations de copyright

© 2023. The Author(s).

Références

Bioinformatics. 2014 Dec 15;30(24):3476-83
pubmed: 25398610
Nat Genet. 2019 Feb;51(2):354-362
pubmed: 30643257
Nat Genet. 2014 Mar;46(3):279-86
pubmed: 24464101
PLoS Med. 2015 Sep 29;12(9):e1001880
pubmed: 26418737
Lancet. 1993 Mar 13;341(8846):647-50
pubmed: 8095569
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53
pubmed: 11504945
Nat Genet. 2012 Jan 08;44(2):226-32
pubmed: 22231483
Science. 2021 Dec 17;374(6574):abg8871
pubmed: 34914532
Appl Environ Microbiol. 2019 Oct 30;85(22):
pubmed: 31471308
Bioinformatics. 2021 Nov 18;37(22):4048-4055
pubmed: 34117875
Sci Transl Med. 2019 Apr 10;11(487):
pubmed: 30971455
Genome Biol. 2009;10(9):R98
pubmed: 19761611
Brief Bioinform. 2018 Jan 1;19(1):118-135
pubmed: 27769991
Genome Biol. 2020 Sep 24;21(1):253
pubmed: 32972461
Bioinformatics. 2009 Dec 15;25(24):3207-12
pubmed: 19808877
Nat Genet. 2011 Dec 18;44(1):106-10
pubmed: 22179134
Bioinformatics. 2020 Jun 1;36(12):3712-3718
pubmed: 32321164
Nat Biotechnol. 2019 Aug;37(8):907-915
pubmed: 31375807
BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):306
pubmed: 32703258
Genome Biol. 2021 Jan 4;22(1):8
pubmed: 33397413
Genome Biol. 2020 Sep 17;21(1):250
pubmed: 32943086
BMC Bioinformatics. 2016 Jun 16;17(1):237
pubmed: 27306641
Proc Natl Acad Sci U S A. 2005 Sep 27;102(39):13950-5
pubmed: 16172379
Nat Biotechnol. 2018 Oct;36(9):875-879
pubmed: 30125266
Genome Res. 2003 Nov;13(11):2498-504
pubmed: 14597658
PLoS One. 2009 Nov 05;4(11):e7778
pubmed: 19890396
iScience. 2021 Jun 10;24(7):102687
pubmed: 34235407
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269
pubmed: 37387143
Bioinformatics. 2017 Oct 15;33(20):3181-3187
pubmed: 28200001
Genome Biol. 2020 Oct 16;21(1):265
pubmed: 33066802
Bioinformatics. 2016 Feb 15;32(4):497-504
pubmed: 26504144
Science. 2007 Nov 9;318(5852):901-2
pubmed: 17991835
Genome Biol. 2020 May 25;21(1):124
pubmed: 32450900
Mol Syst Biol. 2011 Aug 02;7:522
pubmed: 21811232
BMC Bioinformatics. 2018 Sep 4;19(1):311
pubmed: 30180801
G3 (Bethesda). 2015 Mar 17;5(5):931-41
pubmed: 25787242
Antimicrob Agents Chemother. 2013 Feb;57(2):827-32
pubmed: 23208709
Cell. 2017 May 18;169(5):849-861.e13
pubmed: 28502769
Bioinformatics. 2018 Jul 1;34(13):i169-i177
pubmed: 29949982
Bioinformatics. 2016 Nov 1;32(21):3224-3232
pubmed: 27378303
Algorithms Mol Biol. 2016 Jul 18;11:20
pubmed: 27437028
Science. 2000 Mar 24;287(5461):2196-204
pubmed: 10731133

Auteurs

Lore Depuydt (L)

Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium. Lore.Depuydt@UGent.be.

Luca Renders (L)

Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.

Thomas Abeel (T)

Delft Bioinformatics Lab, Delft University of Technology, 2628 XE, Delft, The Netherlands.
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.

Jan Fostier (J)

Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium. Jan.Fostier@UGent.be.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH