Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.

Journal

Nature computational science

ISSN: 2662-8457

Titre abrégé: Nat Comput Sci

Pays: United States

ID NLM: 101775476

Informations de publication

Date de publication:
Feb 2024

Historique:

received: 12 07 2023

accepted: 16 01 2024

medline: 28 2 2024

pubmed: 28 2 2024

entrez: 27 2 2024

Statut: ppublish

Résumé

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

Identifiants

DOI: 10.1038/s43588-024-00596-6 PMID: 38413777

pubmed: 38413777

doi: 10.1038/s43588-024-00596-6

pii: 10.1038/s43588-024-00596-6

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

104-109

Subventions

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-CE45-0008

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-CE45-0008

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-CE45-0008

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-CE45-0008

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-22-CE45-0007

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : PIA/ANR16-CONV-0005

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-P3IA-0001

Organisme : Agence Nationale de la Recherche (French National Research Agency)

ID : ANR-19-CE45-0008

Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)

ID : 956229

Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)

ID : 872539

Informations de copyright

© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).

doi: 10.1038/s41586-021-04332-2 pubmed: 35082445

Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).

doi: 10.1038/s41586-022-04862-3 pubmed: 35732736 pmcid: 9259500

Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).

doi: 10.1093/nar/gkab1053 pubmed: 34850094

Chikhi, R., Holub, J. & Medvedev, P. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54, 1–22 (2021).

doi: 10.1145/3445967

Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).

doi: 10.1101/gr.260604.119 pubmed: 33328168 pmcid: 7849385

Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).

doi: 10.12688/f1000research.19675.1 pubmed: 31508216 pmcid: 6720031

Darvish, M., Seiler, E., Mehringer, S., Rahn, René & Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 38, 4100–4108 (2022).

doi: 10.1093/bioinformatics/btac492 pubmed: 35801930 pmcid: 9438961

Karasikov, M. et al. Metagraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).

Holley, G. & Melsted, P. áll Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).

doi: 10.1186/s13059-020-02135-8 pubmed: 32943081 pmcid: 7499882

Cracco, A. & Tomescu, A. I. Extremely fast construction and querying of compacted and colored de Bruijn graphs with ggcat. Genome Res. 33, 1198–1207 (2023).

pubmed: 37253540 pmcid: 10538363

Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).

doi: 10.1145/362686.362692

Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: a Compact Bit-Sliced Signature index. String Processing and Information Retrieval, SPIRE 2019. In Lecture Notes in Computer Science, Vol. 11811 (Springer, Cham, 2019).

Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol. 25, 755–765 (2018).

doi: 10.1089/cmb.2017.0265 pubmed: 29641248 pmcid: 6067102

Harris, R. S. & Medvedev, P. Improved representation of sequence Bloom trees. Bioinformatics 36, 721–727 (2020).

doi: 10.1093/bioinformatics/btz662 pubmed: 31504157

Srikakulam, S. K., Keller, S., Dabbaghie, F., Bals, R. & Kalinina, O. V. Metaprofi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 39, btad101 (2023).

doi: 10.1093/bioinformatics/btad101 pubmed: 36825843 pmcid: 9994790

The Ocean Read Atlas. OSU Institut Pytheas https://ocean-read-atlas.mio.osupytheas.fr/ (2023).

Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).

doi: 10.1038/s41579-020-0364-5 pubmed: 32398798

Alanko, J. N., Vuohtoniemi, J., Mäklin, T. & Puglisi, S. J. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 39, i260–i269 (2023).

doi: 10.1093/bioinformatics/btad233 pubmed: 37387143 pmcid: 10311346

Mehringer, S. et al. Hierarchical interleaved Bloom filter: enabling ultrafast, approximate sequence queries. Genome Biol. 24, 131 (2023).

doi: 10.1186/s13059-023-02971-4 pubmed: 37259161 pmcid: 10230713

Marchet, C. & Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 39, i252–i259 (2023).

doi: 10.1093/bioinformatics/btad225 pubmed: 37387170 pmcid: 10311332

Lemane, T., Medvedev, P., Chikhi, R. & Peterlongo, P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinform. Adv. 2, vbac029 (2022).

doi: 10.1093/bioadv/vbac029 pubmed: 36699393 pmcid: 9710589

Villar, E. et al. The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 46, W289–W295 (2018).

doi: 10.1093/nar/gky376 pubmed: 29788376 pmcid: 6030836

Vernette, C. et al. The Ocean Gene Atlas v2. 0: online exploration of the biogeography and phylogeny of plankton genes. Nucleic Acids Res. 50, W516–W526 (2022).

doi: 10.1093/nar/gkac420 pubmed: 35687095 pmcid: 9252727

Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).

doi: 10.1038/s42003-021-02112-2 pubmed: 34021239 pmcid: 8139981

Robidou, L. & Peterlongo, P. findere: fast and precise approximate membership query. In International Symposium on String Processing and Information Retrieval 151–163 (Springer, 2021).

fio. GitHub https://github.com/axboe/fio (2023).

DOI of the provided ORA server GitLab code. Zenodo https://doi.org/10.5281/zenodo.10462412 (2024).

European Nucleotide Archive. European Bioinformatics Institute https://www.ebi.ac.uk/ena/ (2023).

Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Registry of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875582 (2017).

Guidi, L., Gattuso, J.-P. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about carbonate chemistry in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875567 (2017).

Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Biodiversity context of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.853809 (2015).

Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about pigment concentrations (HPLC) in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875569 (2017).

Ardyna, M. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about mesoscale features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875577 (2017).

Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about nutrients in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875575 (2017).

Guidi, L., Picheral, M. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about sensor data in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875576 (2017).

Alberti, A. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Methodology used in the lab for molecular analyses and links to the Sequence Read Archive of selected samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875581 (2017).

Speich, S. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about the water column features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875579 (2017).

Overview. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/ (2023).

Interfaces. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/interfaces.html (2023).

pierrepeterlongo/kmindex_benchmarks: initial release. Zenodo https://doi.org/10.5281/zenodo.10462379 (2024).

DOI of the kmindex GitHub repository. Zenodo https://doi.org/10.5281/zenodo.10462427 (2024).

Auteurs

Téo Lemane (T)

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France. teo.lemane@genoscope.cns.fr.

Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France. teo.lemane@genoscope.cns.fr.

ORCID: 0000-0002-7210-3178

Nolan Lezzoche (N)

Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France.

Julien Lecubin (J)

SIP, OSU PYTHEAS, Marseille, France.

Eric Pelletier (E)

Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.

Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France.

ORCID: 0000-0003-4228-1712

Magali Lescot (M)

Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France.

Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France.

Rayan Chikhi (R)

Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France.

Pierre Peterlongo (P)

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France. pierre.peterlongo@inria.fr.

ORCID: 0000-0003-0776-6407

Classifications MeSH