Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.


Journal

Nature computational science
ISSN: 2662-8457
Titre abrégé: Nat Comput Sci
Pays: United States
ID NLM: 101775476

Informations de publication

Date de publication:
Feb 2024
Historique:
received: 12 07 2023
accepted: 16 01 2024
medline: 28 2 2024
pubmed: 28 2 2024
entrez: 27 2 2024
Statut: ppublish

Résumé

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

Identifiants

pubmed: 38413777
doi: 10.1038/s43588-024-00596-6
pii: 10.1038/s43588-024-00596-6
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

104-109

Subventions

Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-22-CE45-0007
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : PIA/ANR16-CONV-0005
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-P3IA-0001
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
ID : 956229
Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
ID : 872539

Informations de copyright

© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
doi: 10.1038/s41586-021-04332-2 pubmed: 35082445
Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).
doi: 10.1038/s41586-022-04862-3 pubmed: 35732736 pmcid: 9259500
Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).
doi: 10.1093/nar/gkab1053 pubmed: 34850094
Chikhi, R., Holub, J. & Medvedev, P. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54, 1–22 (2021).
doi: 10.1145/3445967
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).
doi: 10.1101/gr.260604.119 pubmed: 33328168 pmcid: 7849385
Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).
doi: 10.12688/f1000research.19675.1 pubmed: 31508216 pmcid: 6720031
Darvish, M., Seiler, E., Mehringer, S., Rahn, René & Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 38, 4100–4108 (2022).
doi: 10.1093/bioinformatics/btac492 pubmed: 35801930 pmcid: 9438961
Karasikov, M. et al. Metagraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).
Holley, G. & Melsted, P. áll Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).
doi: 10.1186/s13059-020-02135-8 pubmed: 32943081 pmcid: 7499882
Cracco, A. & Tomescu, A. I. Extremely fast construction and querying of compacted and colored de Bruijn graphs with ggcat. Genome Res. 33, 1198–1207 (2023).
pubmed: 37253540 pmcid: 10538363
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
doi: 10.1145/362686.362692
Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: a Compact Bit-Sliced Signature index. String Processing and Information Retrieval, SPIRE 2019. In Lecture Notes in Computer Science, Vol. 11811 (Springer, Cham, 2019).
Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol. 25, 755–765 (2018).
doi: 10.1089/cmb.2017.0265 pubmed: 29641248 pmcid: 6067102
Harris, R. S. & Medvedev, P. Improved representation of sequence Bloom trees. Bioinformatics 36, 721–727 (2020).
doi: 10.1093/bioinformatics/btz662 pubmed: 31504157
Srikakulam, S. K., Keller, S., Dabbaghie, F., Bals, R. & Kalinina, O. V. Metaprofi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 39, btad101 (2023).
doi: 10.1093/bioinformatics/btad101 pubmed: 36825843 pmcid: 9994790
The Ocean Read Atlas. OSU Institut Pytheas https://ocean-read-atlas.mio.osupytheas.fr/ (2023).
Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).
doi: 10.1038/s41579-020-0364-5 pubmed: 32398798
Alanko, J. N., Vuohtoniemi, J., Mäklin, T. & Puglisi, S. J. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 39, i260–i269 (2023).
doi: 10.1093/bioinformatics/btad233 pubmed: 37387143 pmcid: 10311346
Mehringer, S. et al. Hierarchical interleaved Bloom filter: enabling ultrafast, approximate sequence queries. Genome Biol. 24, 131 (2023).
doi: 10.1186/s13059-023-02971-4 pubmed: 37259161 pmcid: 10230713
Marchet, C. & Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 39, i252–i259 (2023).
doi: 10.1093/bioinformatics/btad225 pubmed: 37387170 pmcid: 10311332
Lemane, T., Medvedev, P., Chikhi, R. & Peterlongo, P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinform. Adv. 2, vbac029 (2022).
doi: 10.1093/bioadv/vbac029 pubmed: 36699393 pmcid: 9710589
Villar, E. et al. The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 46, W289–W295 (2018).
doi: 10.1093/nar/gky376 pubmed: 29788376 pmcid: 6030836
Vernette, C. et al. The Ocean Gene Atlas v2. 0: online exploration of the biogeography and phylogeny of plankton genes. Nucleic Acids Res. 50, W516–W526 (2022).
doi: 10.1093/nar/gkac420 pubmed: 35687095 pmcid: 9252727
Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).
doi: 10.1038/s42003-021-02112-2 pubmed: 34021239 pmcid: 8139981
Robidou, L. & Peterlongo, P. findere: fast and precise approximate membership query. In International Symposium on String Processing and Information Retrieval 151–163 (Springer, 2021).
fio. GitHub https://github.com/axboe/fio (2023).
DOI of the provided ORA server GitLab code. Zenodo https://doi.org/10.5281/zenodo.10462412 (2024).
European Nucleotide Archive. European Bioinformatics Institute https://www.ebi.ac.uk/ena/ (2023).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Registry of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875582 (2017).
Guidi, L., Gattuso, J.-P. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about carbonate chemistry in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875567 (2017).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Biodiversity context of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.853809 (2015).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about pigment concentrations (HPLC) in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875569 (2017).
Ardyna, M. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about mesoscale features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875577 (2017).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about nutrients in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875575 (2017).
Guidi, L., Picheral, M. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about sensor data in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875576 (2017).
Alberti, A. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Methodology used in the lab for molecular analyses and links to the Sequence Read Archive of selected samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875581 (2017).
Speich, S. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about the water column features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875579 (2017).
Overview. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/ (2023).
Interfaces. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/interfaces.html (2023).
pierrepeterlongo/kmindex_benchmarks: initial release. Zenodo https://doi.org/10.5281/zenodo.10462379 (2024).
DOI of the kmindex GitHub repository. Zenodo https://doi.org/10.5281/zenodo.10462427 (2024).

Auteurs

Téo Lemane (T)

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France. teo.lemane@genoscope.cns.fr.
Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France. teo.lemane@genoscope.cns.fr.

Nolan Lezzoche (N)

Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France.

Julien Lecubin (J)

SIP, OSU PYTHEAS, Marseille, France.

Eric Pelletier (E)

Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.
Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France.

Magali Lescot (M)

Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France.
Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France.

Rayan Chikhi (R)

Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France.

Pierre Peterlongo (P)

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France. pierre.peterlongo@inria.fr.

Classifications MeSH