Global, highly specific and fast filtering of alignment seeds.

Alignment Alignment anchor Genome alignment Geometric hashing Spaced seeds

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
10 Jun 2022
Historique:
received: 01 05 2020
accepted: 23 05 2022
entrez: 10 6 2022
pubmed: 11 6 2022
medline: 15 6 2022
Statut: epublish

Résumé

An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Sections du résumé

BACKGROUND BACKGROUND
An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.
RESULTS RESULTS
We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.
CONCLUSIONS CONCLUSIONS
An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Identifiants

pubmed: 35689182
doi: 10.1186/s12859-022-04745-4
pii: 10.1186/s12859-022-04745-4
pmc: PMC9188137
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

225

Subventions

Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung
ID : 407540_167331

Informations de copyright

© 2022. The Author(s).

Références

Bioinformatics. 2008 Aug 15;24(16):1757-64
pubmed: 18567917
Nucleic Acids Res. 2014 Apr;42(7):e59
pubmed: 24493737
Genome Res. 2003 Jan;13(1):103-7
pubmed: 12529312
J Bioinform Comput Biol. 2004 Jan;1(4):595-610
pubmed: 15290755
Bioinformatics. 2007 Nov 15;23(22):2969-77
pubmed: 17804438
Bioinformatics. 2009 Feb 1;25(3):302-8
pubmed: 19095701
BMC Bioinformatics. 2008 Nov 12;9:476
pubmed: 19014477
Genome Res. 1996 Sep;6(9):846-57
pubmed: 8889551
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1:S37
pubmed: 20122210
Proc Natl Acad Sci U S A. 1983 Feb;80(3):726-30
pubmed: 6572363
Database (Oxford). 2011 Jul 23;2011:bar030
pubmed: 21785142
PLoS Comput Biol. 2016 Oct 19;12(10):e1005107
pubmed: 27760124
Bioinformatics. 2004 May 1;20(7):1053-9
pubmed: 14764573
J Comput Biol. 2006 Mar;13(2):296-308
pubmed: 16597241
BMC Bioinformatics. 2004 Oct 14;5:149
pubmed: 15485572
Bioinformatics. 2011 Sep 1;27(17):2433-4
pubmed: 21690104
Bioinformatics. 2006 Jul 15;22(14):e341-9
pubmed: 16873491
Annu Rev Anim Biosci. 2019 Feb 15;7:41-64
pubmed: 30379572
Bioinformatics. 2002 Mar;18(3):440-5
pubmed: 11934743
Genome Res. 2002 Jun;12(6):996-1006
pubmed: 12045153
J Comput Biol. 2005 Jul-Aug;12(6):847-61
pubmed: 16108721
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W540-3
pubmed: 15980530
Bioinformatics. 2019 Jan 15;35(2):211-218
pubmed: 29992260
Nucleic Acids Res. 2021 Jan 8;49(D1):D1046-D1057
pubmed: 33221922
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
J Bioinform Comput Biol. 2004 Sep;2(3):417-39
pubmed: 15359419

Auteurs

Matthis Ebel (M)

Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany.
Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany.

Giovanna Migliorelli (G)

Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany.
Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany.

Mario Stanke (M)

Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany. mario.stanke@uni-greifswald.de.
Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany. mario.stanke@uni-greifswald.de.

Articles similaires

Robotic Surgical Procedures Animals Humans Telemedicine Models, Animal

Odour generalisation and detection dog training.

Lyn Caldicott, Thomas W Pike, Helen E Zulch et al.
1.00
Animals Odorants Dogs Generalization, Psychological Smell

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals TOR Serine-Threonine Kinases Colorectal Neoplasms Colitis Mice

Classifications MeSH