GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.
Journal
Nucleic acids research
ISSN: 1362-4962
Titre abrégé: Nucleic Acids Res
Pays: England
ID NLM: 0411011
Informations de publication
Date de publication:
16 Jul 2024
16 Jul 2024
Historique:
accepted:
27
06
2024
revised:
20
06
2024
received:
15
11
2023
medline:
16
7
2024
pubmed:
16
7
2024
entrez:
16
7
2024
Statut:
aheadofprint
Résumé
Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Identifiants
pubmed: 39011878
pii: 7714450
doi: 10.1093/nar/gkae609
pii:
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : National Science Foundation
ID : 1759831
Informations de copyright
© The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.