Dynamic partitioning of search patterns for approximate pattern matching using search schemes.

Algorithms Bioinformatics Computer science High-performance computing in bioinformatics

Journal

iScience
ISSN: 2589-0042
Titre abrégé: iScience
Pays: United States
ID NLM: 101724038

Informations de publication

Date de publication:
23 Jul 2021
Historique:
received: 19 03 2021
revised: 17 05 2021
accepted: 28 05 2021
entrez: 8 7 2021
pubmed: 9 7 2021
medline: 9 7 2021
Statut: epublish

Résumé

Search schemes constitute a flexible and generic framework to describe how all approximate occurrences of a search pattern in a text can be found efficiently. We propose an algorithm for the dynamic partitioning of search patterns which can be universally applied to any kind of search scheme and demonstrate that this technique significantly reduces the search space. We present Columba, a software tool written in C++, in which a multitude of search schemes are implemented. We discuss implementation aspects such as memory interleaving of Burrows-Wheeler transform representations and the reduction of redundancy that is inherently associated with the edit distance metric. Ultimately, we demonstrate that Columba has superior performance to the state of the art. Using a single CPU core, Columba is able to retrieve all occurrences of 100,000 Illumina reads and their reverse complements within a maximum edit distance of four in the human genome in less than 3 min.

Identifiants

pubmed: 34235407
doi: 10.1016/j.isci.2021.102687
pii: S2589-0042(21)00655-6
pmc: PMC8246400
doi:

Types de publication

Journal Article

Langues

eng

Pagination

102687

Informations de copyright

© 2021 The Author(s).

Déclaration de conflit d'intérêts

The authors declare no competing interests.

Références

Genome Res. 2002 Apr;12(4):656-64
pubmed: 11932250
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Genome Res. 2017 May;27(5):849-864
pubmed: 28396521

Auteurs

Luca Renders (L)

Department of Information Technology, Ghent University - imec, Ghent 9052, Belgium.

Kathleen Marchal (K)

Department of Information Technology, Ghent University - imec, Ghent 9052, Belgium.
Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium.

Jan Fostier (J)

Department of Information Technology, Ghent University - imec, Ghent 9052, Belgium.

Classifications MeSH