Global, highly specific and fast filtering of alignment seeds.
Alignment
Alignment anchor
Genome alignment
Geometric hashing
Spaced seeds
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
10 Jun 2022
10 Jun 2022
Historique:
received:
01
05
2020
accepted:
23
05
2022
entrez:
10
6
2022
pubmed:
11
6
2022
medline:
15
6
2022
Statut:
epublish
Résumé
An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.
Sections du résumé
BACKGROUND
BACKGROUND
An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.
RESULTS
RESULTS
We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.
CONCLUSIONS
CONCLUSIONS
An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.
Identifiants
pubmed: 35689182
doi: 10.1186/s12859-022-04745-4
pii: 10.1186/s12859-022-04745-4
pmc: PMC9188137
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
225Subventions
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung
ID : 407540_167331
Informations de copyright
© 2022. The Author(s).
Références
Bioinformatics. 2008 Aug 15;24(16):1757-64
pubmed: 18567917
Nucleic Acids Res. 2014 Apr;42(7):e59
pubmed: 24493737
Genome Res. 2003 Jan;13(1):103-7
pubmed: 12529312
J Bioinform Comput Biol. 2004 Jan;1(4):595-610
pubmed: 15290755
Bioinformatics. 2007 Nov 15;23(22):2969-77
pubmed: 17804438
Bioinformatics. 2009 Feb 1;25(3):302-8
pubmed: 19095701
BMC Bioinformatics. 2008 Nov 12;9:476
pubmed: 19014477
Genome Res. 1996 Sep;6(9):846-57
pubmed: 8889551
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1:S37
pubmed: 20122210
Proc Natl Acad Sci U S A. 1983 Feb;80(3):726-30
pubmed: 6572363
Database (Oxford). 2011 Jul 23;2011:bar030
pubmed: 21785142
PLoS Comput Biol. 2016 Oct 19;12(10):e1005107
pubmed: 27760124
Bioinformatics. 2004 May 1;20(7):1053-9
pubmed: 14764573
J Comput Biol. 2006 Mar;13(2):296-308
pubmed: 16597241
BMC Bioinformatics. 2004 Oct 14;5:149
pubmed: 15485572
Bioinformatics. 2011 Sep 1;27(17):2433-4
pubmed: 21690104
Bioinformatics. 2006 Jul 15;22(14):e341-9
pubmed: 16873491
Annu Rev Anim Biosci. 2019 Feb 15;7:41-64
pubmed: 30379572
Bioinformatics. 2002 Mar;18(3):440-5
pubmed: 11934743
Genome Res. 2002 Jun;12(6):996-1006
pubmed: 12045153
J Comput Biol. 2005 Jul-Aug;12(6):847-61
pubmed: 16108721
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W540-3
pubmed: 15980530
Bioinformatics. 2019 Jan 15;35(2):211-218
pubmed: 29992260
Nucleic Acids Res. 2021 Jan 8;49(D1):D1046-D1057
pubmed: 33221922
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
J Bioinform Comput Biol. 2004 Sep;2(3):417-39
pubmed: 15359419