The archives are half-empty: an assessment of the availability of microbial community sequencing data.
Journal
Communications biology
ISSN: 2399-3642
Titre abrégé: Commun Biol
Pays: England
ID NLM: 101719179
Informations de publication
Date de publication:
28 08 2020
28 08 2020
Historique:
received:
22
06
2020
accepted:
03
08
2020
entrez:
30
8
2020
pubmed:
30
8
2020
medline:
23
6
2021
Statut:
epublish
Résumé
As DNA sequencing has become more popular, the public genetic repositories where sequences are archived have experienced explosive growth. These repositories now hold invaluable collections of sequences, e.g., for microbial ecology, but whether these data are reusable has not been evaluated. We assessed the availability and state of 16S rRNA gene amplicon sequences archived in public genetic repositories (SRA, EBI, and DDJ). We screened 26,927 publications in 17 microbiology journals, identifying 2015 16S rRNA gene sequencing studies. Of these, 7.2% had not made their data public at the time of analysis. Among a subset of 635 studies sequencing the same gene region, 40.3% contained data which was not available or not reusable, and an additional 25.5% contained faults in data formatting or data labeling, creating obstacles for data reuse. Our study reveals gaps in data availability, identifies major contributors to data loss, and offers suggestions for improving data archiving practices.
Identifiants
pubmed: 32859925
doi: 10.1038/s42003-020-01204-9
pii: 10.1038/s42003-020-01204-9
pmc: PMC7455719
doi:
Substances chimiques
RNA, Ribosomal, 16S
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
474Références
Steen, A. D. et al. High proportions of bacteria and archaea across most biomes remain uncultured. ISME J. https://doi.org/10.1038/s41396-019-0484-y (2019).
Lloyd, K. G., Steen, A. D., Ladau, J., Yin, J. & Crosby, L. Phylogenetically novel uncultured microbial cells dominate Earth microbiomes. mSystems 3, 1–12 (2018).
doi: 10.1128/mSystems.00055-18
Kodama, Y., Shumway, M. & Leinonen, R. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, 2011–2013 (2012).
Locey, K. J. & Lennon, J. T. Scaling laws predict global microbial diversity. Proc. Natl Acad. Sci. USA 113, 5970 LP–5975975 (2016).
doi: 10.1073/pnas.1521291113
Shade, A. et al. Macroecology to unite all life, large and small. Trends Ecol. Evol. 33, 731–744 (2018).
doi: 10.1016/j.tree.2018.08.005
Langenheder, S. & Lindström, E. S. Factors influencing aquatic and terrestrial bacterial community assembly. Environ. Microbiol. Rep. 11, 306–315 (2019).
doi: 10.1111/1758-2229.12731
Stegen, J. C., Bottos, E. M. & Jansson, J. K. A unified conceptual framework for prediction and control of microbiomes. Curr. Opin. Microbiol. 44, 20–27 (2018).
doi: 10.1016/j.mib.2018.06.002
Thompson, L. R. et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551, 457–463 (2017).
doi: 10.1038/nature24621
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679 (2019).
doi: 10.1038/s41591-019-0406-6
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533 (2017).
doi: 10.1038/s41564-017-0012-7
Rocca, J. D. et al. The Microbiome Stress Project: toward a global meta-analysis of environmental stressors and their effects on microbial communities. Front. Microbiol. 9, 3272 (2019).
doi: 10.3389/fmicb.2018.03272
Wilkinson, M. D. et al The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
doi: 10.1038/sdata.2016.18
Hampton, S. E. et al. Big data and the future of ecology. Front. Ecol. Environ. 11, 156–162 (2013).
doi: 10.1890/120103
Roche, D. G., Kruuk, L. E. B., Lanfear, R. & Binning, S. A. Public data archiving in ecology and evolution: how well are we doing? https://doi.org/10.1371/journal.pbio.1002295 (2015).
Karsch-Mizrachi, I., Nakamura, Y. & Cochrane, G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 40, 33–37 (2012).
doi: 10.1093/nar/gkr1006
Zhou, Z., Wang, C. & Luo, Y. Effects of forest degradation on microbial communities and soil carbon cycling: a global meta-analysis. Glob. Ecol. Biogeogr. 27, 110–124 (2018).
doi: 10.1111/geb.12663
Yilmaz, P. et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat. Biotechnol. 29, 415–420 (2011).
doi: 10.1038/nbt.1823
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).
doi: 10.1038/ng1007-1181
Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796 (2018).
doi: 10.1038/s41592-018-0141-9
Keegan, K. P., Glass, E. M. & Meyer, F. MG-RAST, a metagenomics service for analysis of microbial community structure and function. In Microbial Environmental Genomics (MEG). pp. 207–233 (Springer, 2016).
Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. USA 108, 4516 LP–4514522 (2011).
doi: 10.1073/pnas.1000080107
Gilbert, J. A., Jansson, J. K. & Knight, R. The Earth Microbiome project: successes and aspirations. BMC Biol. 12, 69 (2014).
doi: 10.1186/s12915-014-0069-1
Craven, D. et al. Evolution of interdisciplinarity in biodiversity science. Ecol. Evol. 9, 6744–6755 (2019).
National Center for Biotechnology Information (2010) SRA Handbook.
Bolyen, E. et al. QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science PeerJ Preprints (2018).
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
doi: 10.1093/bioinformatics/bts480
Lima, M. S. & Smith, D. R. Don’t just dump your data and run: authors should submit as much experimental information as possible when uploading sequence data. EMBO Rep. 18, 2087–2089 (2017).
Bartram, A. K., Lynch, M. D. J., Stearns, J. C., Moreno-Hagelsieb, G. & Neufeld, J. D. Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl. Environ. Microbiol. 77, 3846–3852 (2011).
doi: 10.1128/AEM.02772-10
Harzing, A. W. Publish or perish in the news (2007).
Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).
doi: 10.1128/AEM.01541-09
Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336 (2010).
doi: 10.1038/nmeth.f.303
Caporaso, J. G. et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6, 1621–1624 (2012).
doi: 10.1038/ismej.2012.8
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 47, D23 (2019).
doi: 10.1093/nar/gky1069
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).
doi: 10.14806/ej.17.1.200
Apprill, A., McNally, S., Parsons, R. & Weber, L. Minor revision to V4 region SSU rRNA 806R gene primer greatly increases detection of SAR11 bacterioplankton. Aquat. Microb. Ecol. 75, 129–137 (2015).
doi: 10.3354/ame01753
Parada, A. E., Needham, D. M. & Fuhrman, J. A. Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ. Microbiol. 18, 1403–1414 (2016).
doi: 10.1111/1462-2920.13023
Andrews, S. FastQC: a quality control tool for high throughput sequence data (2010).
Kans, J. Entrez direct: E-utilities on the UNIX command line. In Entrez Programming Utilities Help [Internet]. (National Center for Biotechnology Information (US), 2020).
Team, R. C. R: a language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, Austria 2016).
Heintz-Buschart, A., Jurburg, S. D., Konzack, M. & Eisenhauer, N. Data_availability_study: final manuscript. Zenodo, https://doi.org/10.5281/zenodo.3953307 (2020).
Konzack, M. (2020) teitocsv. Zenodo. https://doi.org/10.5281/zenodo.3953314 .
Vines, T. H. et al. Mandated data archiving greatly improves access to research data. FASEB J. 27, 1304–1308 (2013).
doi: 10.1096/fj.12-218164
Rambold, G. et al. Meta-omics data and collection objects (MOD-CO): a conceptual schema and data model for processing sample data in meta-omics research. Database. 2019, baz002 (2019).
doi: 10.1093/database/baz002
Marchesi, J. R. & Ravel, J. The vocabulary of microbiome research: a proposal. Microbiome 3, 31 (2015).
doi: 10.1186/s40168-015-0094-5