Dynamic partitioning of search patterns for approximate pattern matching using search schemes.
Algorithms
Bioinformatics
Computer science
High-performance computing in bioinformatics
Journal
iScience
ISSN: 2589-0042
Titre abrégé: iScience
Pays: United States
ID NLM: 101724038
Informations de publication
Date de publication:
23 Jul 2021
23 Jul 2021
Historique:
received:
19
03
2021
revised:
17
05
2021
accepted:
28
05
2021
entrez:
8
7
2021
pubmed:
9
7
2021
medline:
9
7
2021
Statut:
epublish
Résumé
Search schemes constitute a flexible and generic framework to describe how all approximate occurrences of a search pattern in a text can be found efficiently. We propose an algorithm for the dynamic partitioning of search patterns which can be universally applied to any kind of search scheme and demonstrate that this technique significantly reduces the search space. We present Columba, a software tool written in C++, in which a multitude of search schemes are implemented. We discuss implementation aspects such as memory interleaving of Burrows-Wheeler transform representations and the reduction of redundancy that is inherently associated with the edit distance metric. Ultimately, we demonstrate that Columba has superior performance to the state of the art. Using a single CPU core, Columba is able to retrieve all occurrences of 100,000 Illumina reads and their reverse complements within a maximum edit distance of four in the human genome in less than 3 min.
Identifiants
pubmed: 34235407
doi: 10.1016/j.isci.2021.102687
pii: S2589-0042(21)00655-6
pmc: PMC8246400
doi:
Types de publication
Journal Article
Langues
eng
Pagination
102687Informations de copyright
© 2021 The Author(s).
Déclaration de conflit d'intérêts
The authors declare no competing interests.
Références
Genome Res. 2002 Apr;12(4):656-64
pubmed: 11932250
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Genome Res. 2017 May;27(5):849-864
pubmed: 28396521