GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.


Journal

Nucleic acids research
ISSN: 1362-4962
Titre abrégé: Nucleic Acids Res
Pays: England
ID NLM: 0411011

Informations de publication

Date de publication:
16 Jul 2024
Historique:
accepted: 27 06 2024
revised: 20 06 2024
received: 15 11 2023
medline: 16 7 2024
pubmed: 16 7 2024
entrez: 16 7 2024
Statut: aheadofprint

Résumé

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

Identifiants

pubmed: 39011878
pii: 7714450
doi: 10.1093/nar/gkae609
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : National Science Foundation
ID : 1759831

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.

Auteurs

Jianshu Zhao (J)

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.
School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.

Jean Pierre Both (JP)

Université Paris-Saclay, CEA, List, Palaiseau, France.

Luis M Rodriguez-R (LM)

School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
Department of Microbiology, University of Innsbruck, Innsbruck, Austria.
Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria.

Konstantinos T Konstantinidis (KT)

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.
School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.
School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA.

Classifications MeSH