Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.
Journal
Nature computational science
ISSN: 2662-8457
Titre abrégé: Nat Comput Sci
Pays: United States
ID NLM: 101775476
Informations de publication
Date de publication:
Feb 2024
Feb 2024
Historique:
received:
12
07
2023
accepted:
16
01
2024
medline:
28
2
2024
pubmed:
28
2
2024
entrez:
27
2
2024
Statut:
ppublish
Résumé
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.
Identifiants
pubmed: 38413777
doi: 10.1038/s43588-024-00596-6
pii: 10.1038/s43588-024-00596-6
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
104-109Subventions
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-22-CE45-0007
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : PIA/ANR16-CONV-0005
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-P3IA-0001
Organisme : Agence Nationale de la Recherche (French National Research Agency)
ID : ANR-19-CE45-0008
Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
ID : 956229
Organisme : EC | EU Framework Programme for Research and Innovation H2020 | H2020 Priority Excellent Science | H2020 Marie Skłodowska-Curie Actions (H2020 Excellent Science - Marie Skłodowska-Curie Actions)
ID : 872539
Informations de copyright
© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.
Références
Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
doi: 10.1038/s41586-021-04332-2
pubmed: 35082445
Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).
doi: 10.1038/s41586-022-04862-3
pubmed: 35732736
pmcid: 9259500
Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).
doi: 10.1093/nar/gkab1053
pubmed: 34850094
Chikhi, R., Holub, J. & Medvedev, P. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54, 1–22 (2021).
doi: 10.1145/3445967
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).
doi: 10.1101/gr.260604.119
pubmed: 33328168
pmcid: 7849385
Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).
doi: 10.12688/f1000research.19675.1
pubmed: 31508216
pmcid: 6720031
Darvish, M., Seiler, E., Mehringer, S., Rahn, René & Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 38, 4100–4108 (2022).
doi: 10.1093/bioinformatics/btac492
pubmed: 35801930
pmcid: 9438961
Karasikov, M. et al. Metagraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).
Holley, G. & Melsted, P. áll Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).
doi: 10.1186/s13059-020-02135-8
pubmed: 32943081
pmcid: 7499882
Cracco, A. & Tomescu, A. I. Extremely fast construction and querying of compacted and colored de Bruijn graphs with ggcat. Genome Res. 33, 1198–1207 (2023).
pubmed: 37253540
pmcid: 10538363
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
doi: 10.1145/362686.362692
Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: a Compact Bit-Sliced Signature index. String Processing and Information Retrieval, SPIRE 2019. In Lecture Notes in Computer Science, Vol. 11811 (Springer, Cham, 2019).
Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol. 25, 755–765 (2018).
doi: 10.1089/cmb.2017.0265
pubmed: 29641248
pmcid: 6067102
Harris, R. S. & Medvedev, P. Improved representation of sequence Bloom trees. Bioinformatics 36, 721–727 (2020).
doi: 10.1093/bioinformatics/btz662
pubmed: 31504157
Srikakulam, S. K., Keller, S., Dabbaghie, F., Bals, R. & Kalinina, O. V. Metaprofi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 39, btad101 (2023).
doi: 10.1093/bioinformatics/btad101
pubmed: 36825843
pmcid: 9994790
The Ocean Read Atlas. OSU Institut Pytheas https://ocean-read-atlas.mio.osupytheas.fr/ (2023).
Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).
doi: 10.1038/s41579-020-0364-5
pubmed: 32398798
Alanko, J. N., Vuohtoniemi, J., Mäklin, T. & Puglisi, S. J. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 39, i260–i269 (2023).
doi: 10.1093/bioinformatics/btad233
pubmed: 37387143
pmcid: 10311346
Mehringer, S. et al. Hierarchical interleaved Bloom filter: enabling ultrafast, approximate sequence queries. Genome Biol. 24, 131 (2023).
doi: 10.1186/s13059-023-02971-4
pubmed: 37259161
pmcid: 10230713
Marchet, C. & Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 39, i252–i259 (2023).
doi: 10.1093/bioinformatics/btad225
pubmed: 37387170
pmcid: 10311332
Lemane, T., Medvedev, P., Chikhi, R. & Peterlongo, P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinform. Adv. 2, vbac029 (2022).
doi: 10.1093/bioadv/vbac029
pubmed: 36699393
pmcid: 9710589
Villar, E. et al. The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 46, W289–W295 (2018).
doi: 10.1093/nar/gky376
pubmed: 29788376
pmcid: 6030836
Vernette, C. et al. The Ocean Gene Atlas v2. 0: online exploration of the biogeography and phylogeny of plankton genes. Nucleic Acids Res. 50, W516–W526 (2022).
doi: 10.1093/nar/gkac420
pubmed: 35687095
pmcid: 9252727
Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).
doi: 10.1038/s42003-021-02112-2
pubmed: 34021239
pmcid: 8139981
Robidou, L. & Peterlongo, P. findere: fast and precise approximate membership query. In International Symposium on String Processing and Information Retrieval 151–163 (Springer, 2021).
fio. GitHub https://github.com/axboe/fio (2023).
DOI of the provided ORA server GitLab code. Zenodo https://doi.org/10.5281/zenodo.10462412 (2024).
European Nucleotide Archive. European Bioinformatics Institute https://www.ebi.ac.uk/ena/ (2023).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Registry of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875582 (2017).
Guidi, L., Gattuso, J.-P. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about carbonate chemistry in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875567 (2017).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Biodiversity context of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.853809 (2015).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about pigment concentrations (HPLC) in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875569 (2017).
Ardyna, M. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about mesoscale features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875577 (2017).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about nutrients in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875575 (2017).
Guidi, L., Picheral, M. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about sensor data in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875576 (2017).
Alberti, A. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Methodology used in the lab for molecular analyses and links to the Sequence Read Archive of selected samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875581 (2017).
Speich, S. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about the water column features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875579 (2017).
Overview. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/ (2023).
Interfaces. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/interfaces.html (2023).
pierrepeterlongo/kmindex_benchmarks: initial release. Zenodo https://doi.org/10.5281/zenodo.10462379 (2024).
DOI of the kmindex GitHub repository. Zenodo https://doi.org/10.5281/zenodo.10462427 (2024).