Pan-genome de Bruijn graph using the bidirectional FM-index.
Approximate pattern matching
Lossless alignment
Pan-genome visualization
Search schemes
Sequence-to-graph alignment
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
26 Oct 2023
26 Oct 2023
Historique:
received:
13
02
2023
accepted:
12
10
2023
medline:
30
10
2023
pubmed:
27
10
2023
entrez:
26
10
2023
Statut:
epublish
Résumé
Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Sections du résumé
BACKGROUND
BACKGROUND
Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance.
RESULTS
RESULTS
We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment.
CONCLUSIONS
CONCLUSIONS
We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Identifiants
pubmed: 37884897
doi: 10.1186/s12859-023-05531-6
pii: 10.1186/s12859-023-05531-6
pmc: PMC10605969
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
400Subventions
Organisme : Fonds Wetenschappelijk Onderzoek
ID : 1117322N
Organisme : Fonds Wetenschappelijk Onderzoek
ID : 1SE7822N
Informations de copyright
© 2023. The Author(s).
Références
Bioinformatics. 2014 Dec 15;30(24):3476-83
pubmed: 25398610
Nat Genet. 2019 Feb;51(2):354-362
pubmed: 30643257
Nat Genet. 2014 Mar;46(3):279-86
pubmed: 24464101
PLoS Med. 2015 Sep 29;12(9):e1001880
pubmed: 26418737
Lancet. 1993 Mar 13;341(8846):647-50
pubmed: 8095569
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53
pubmed: 11504945
Nat Genet. 2012 Jan 08;44(2):226-32
pubmed: 22231483
Science. 2021 Dec 17;374(6574):abg8871
pubmed: 34914532
Appl Environ Microbiol. 2019 Oct 30;85(22):
pubmed: 31471308
Bioinformatics. 2021 Nov 18;37(22):4048-4055
pubmed: 34117875
Sci Transl Med. 2019 Apr 10;11(487):
pubmed: 30971455
Genome Biol. 2009;10(9):R98
pubmed: 19761611
Brief Bioinform. 2018 Jan 1;19(1):118-135
pubmed: 27769991
Genome Biol. 2020 Sep 24;21(1):253
pubmed: 32972461
Bioinformatics. 2009 Dec 15;25(24):3207-12
pubmed: 19808877
Nat Genet. 2011 Dec 18;44(1):106-10
pubmed: 22179134
Bioinformatics. 2020 Jun 1;36(12):3712-3718
pubmed: 32321164
Nat Biotechnol. 2019 Aug;37(8):907-915
pubmed: 31375807
BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):306
pubmed: 32703258
Genome Biol. 2021 Jan 4;22(1):8
pubmed: 33397413
Genome Biol. 2020 Sep 17;21(1):250
pubmed: 32943086
BMC Bioinformatics. 2016 Jun 16;17(1):237
pubmed: 27306641
Proc Natl Acad Sci U S A. 2005 Sep 27;102(39):13950-5
pubmed: 16172379
Nat Biotechnol. 2018 Oct;36(9):875-879
pubmed: 30125266
Genome Res. 2003 Nov;13(11):2498-504
pubmed: 14597658
PLoS One. 2009 Nov 05;4(11):e7778
pubmed: 19890396
iScience. 2021 Jun 10;24(7):102687
pubmed: 34235407
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269
pubmed: 37387143
Bioinformatics. 2017 Oct 15;33(20):3181-3187
pubmed: 28200001
Genome Biol. 2020 Oct 16;21(1):265
pubmed: 33066802
Bioinformatics. 2016 Feb 15;32(4):497-504
pubmed: 26504144
Science. 2007 Nov 9;318(5852):901-2
pubmed: 17991835
Genome Biol. 2020 May 25;21(1):124
pubmed: 32450900
Mol Syst Biol. 2011 Aug 02;7:522
pubmed: 21811232
BMC Bioinformatics. 2018 Sep 4;19(1):311
pubmed: 30180801
G3 (Bethesda). 2015 Mar 17;5(5):931-41
pubmed: 25787242
Antimicrob Agents Chemother. 2013 Feb;57(2):827-32
pubmed: 23208709
Cell. 2017 May 18;169(5):849-861.e13
pubmed: 28502769
Bioinformatics. 2018 Jul 1;34(13):i169-i177
pubmed: 29949982
Bioinformatics. 2016 Nov 1;32(21):3224-3232
pubmed: 27378303
Algorithms Mol Biol. 2016 Jul 18;11:20
pubmed: 27437028
Science. 2000 Mar 24;287(5461):2196-204
pubmed: 10731133