Detecting gene breakpoints in noisy genome sequences using position-annotated colored de-Bruijn graphs.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
05 Jun 2023
Historique:
received: 20 03 2023
accepted: 30 05 2023
medline: 7 6 2023
pubmed: 6 6 2023
entrez: 5 6 2023
Statut: epublish

Résumé

Identifying the locations of gene breakpoints between species of different taxonomic groups can provide useful insights into the underlying evolutionary processes. Given the exact locations of their genes, the breakpoints can be computed without much effort. However, often, existing gene annotations are erroneous, or only nucleotide sequences are available. Especially in mitochondrial genomes, high variations in gene orders are usually accompanied by a high degree of sequence inconsistencies. This makes accurately locating breakpoints in mitogenomic nucleotide sequences a challenging task. This contribution presents a novel method for detecting gene breakpoints in the nucleotide sequences of complete mitochondrial genomes, taking into account possible high substitution rates. The method is implemented in the software package DeBBI. DeBBI allows to analyze transposition- and inversion-based breakpoints independently and uses a parallel program design, allowing to make use of modern multi-processor systems. Extensive tests on synthetic data sets, covering a broad range of sequence dissimilarities and different numbers of introduced breakpoints, demonstrate DeBBI 's ability to produce accurate results. Case studies using species of various taxonomic groups further show DeBBI 's applicability to real-life data. While (some) multiple sequence alignment tools can also be used for the task at hand, we demonstrate that especially gene breaks between short, poorly conserved tRNA genes can be detected more frequently with the proposed approach. The proposed method constructs a position-annotated de-Bruijn graph of the input sequences. Using a heuristic algorithm, this graph is searched for particular structures, called bulges, which may be associated with the breakpoint locations. Despite the large size of these structures, the algorithm only requires a small number of graph traversal steps.

Sections du résumé

BACKGROUND BACKGROUND
Identifying the locations of gene breakpoints between species of different taxonomic groups can provide useful insights into the underlying evolutionary processes. Given the exact locations of their genes, the breakpoints can be computed without much effort. However, often, existing gene annotations are erroneous, or only nucleotide sequences are available. Especially in mitochondrial genomes, high variations in gene orders are usually accompanied by a high degree of sequence inconsistencies. This makes accurately locating breakpoints in mitogenomic nucleotide sequences a challenging task.
RESULTS RESULTS
This contribution presents a novel method for detecting gene breakpoints in the nucleotide sequences of complete mitochondrial genomes, taking into account possible high substitution rates. The method is implemented in the software package DeBBI. DeBBI allows to analyze transposition- and inversion-based breakpoints independently and uses a parallel program design, allowing to make use of modern multi-processor systems. Extensive tests on synthetic data sets, covering a broad range of sequence dissimilarities and different numbers of introduced breakpoints, demonstrate DeBBI 's ability to produce accurate results. Case studies using species of various taxonomic groups further show DeBBI 's applicability to real-life data. While (some) multiple sequence alignment tools can also be used for the task at hand, we demonstrate that especially gene breaks between short, poorly conserved tRNA genes can be detected more frequently with the proposed approach.
CONCLUSION CONCLUSIONS
The proposed method constructs a position-annotated de-Bruijn graph of the input sequences. Using a heuristic algorithm, this graph is searched for particular structures, called bulges, which may be associated with the breakpoint locations. Despite the large size of these structures, the algorithm only requires a small number of graph traversal steps.

Identifiants

pubmed: 37277700
doi: 10.1186/s12859-023-05371-4
pii: 10.1186/s12859-023-05371-4
pmc: PMC10243065
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

235

Subventions

Organisme : Deutsche Forschungsgemeinschaft
ID : 21210538
Organisme : Universität Leipzig
ID : 21210538

Informations de copyright

© 2023. The Author(s).

Références

Bioinformatics. 2012 Sep 15;28(18):i333-i339
pubmed: 22962449
Mol Phylogenet Evol. 2013 Nov;69(2):352-64
pubmed: 23684911
Nucleic Acids Res. 2012 Apr;40(7):2833-45
pubmed: 22139921
J Comput Biol. 1998 Fall;5(3):555-70
pubmed: 9773350
BMC Genomics. 2006 Jul 19;7:182
pubmed: 16854241
Heredity (Edinb). 2008 Oct;101(4):301-20
pubmed: 18612321
Nat Rev Genet. 2022 May;23(5):298-314
pubmed: 34880424
Genome Res. 2004 Jul;14(7):1394-403
pubmed: 15231754
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
J Mol Evol. 2006 Sep;63(3):375-92
pubmed: 16838214
BMC Genomics. 2014;15 Suppl 6:S6
pubmed: 25572416
Cell Genom. 2022 Mar 22;2(4):100112
pubmed: 36776527
PLoS One. 2010 Jun 25;5(6):e11147
pubmed: 20593022
Mol Phylogenet Evol. 2013 Nov;69(2):313-9
pubmed: 22982435
BMC Bioinformatics. 2015 Aug 11;16:250
pubmed: 26260162
PLoS One. 2013 Dec 16;8(12):e83356
pubmed: 24358278
Bioinformatics. 2012 Oct 15;28(20):2576-83
pubmed: 22851530
IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):487-98
pubmed: 26357234
Mol Biol Evol. 1997 Jan;14(1):91-104
pubmed: 9000757
Elife. 2018 Jun 13;7:
pubmed: 29897334
Nat Biotechnol. 2022 Jun;40(6):896-905
pubmed: 35058622
Cell Stem Cell. 2022 Mar 3;29(3):472-486.e7
pubmed: 35176222
Genome Res. 2008 May;18(5):821-9
pubmed: 18349386
Genome Res. 2017 Dec;27(12):2050-2060
pubmed: 29097403
iScience. 2019 Aug 30;18:28-36
pubmed: 31377530
Bioinformatics. 2020 May 1;36(9):2725-2730
pubmed: 31985791
J Mol Evol. 1999 Aug;49(2):193-203
pubmed: 10441671
Mol Biol Evol. 2003 Oct;20(10):1612-9
pubmed: 12832626
Bioinformatics. 2013 Dec 15;29(24):3143-50
pubmed: 24072733
Brief Bioinform. 2015 Sep;16(5):852-64
pubmed: 25504367
Comput Appl Biosci. 1997 Jun;13(3):235-8
pubmed: 9183526
Nat Genet. 2012 Jan 08;44(2):226-32
pubmed: 22231483
Algorithms Mol Biol. 2017 Aug 23;12:22
pubmed: 28852417

Auteurs

Lisa Fiedler (L)

Department of Computer Science, University Leipzig, Augustusplatz 10-11, 04109, Leipzig, Germany. lfiedler@informatik.uni-leipzig.de.

Matthias Bernt (M)

Helmholtz Centre for Environmental Research -UFZ, Permoserstraße 15, 04318, Leipzig, Germany.

Martin Middendorf (M)

Department of Computer Science, University Leipzig, Augustusplatz 10-11, 04109, Leipzig, Germany.

Peter F Stadler (PF)

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, 04107, Leipzig, Germany.
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04109, Leipzig, Germany.
Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, 1090, Vienna, Austria.
Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Ciudad Universitaria, 111321, Bogotá, D.C., Colombia.
Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM, 87501, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH