PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S ribosomal RNA, ITS, and COI marker genes.
Docker
HPC
container
eDNA
high performance computing
metabarcoding
pipeline
singularity
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
01 03 2020
01 03 2020
Historique:
received:
18
11
2019
revised:
05
01
2020
accepted:
14
02
2020
entrez:
13
3
2020
pubmed:
13
3
2020
medline:
28
1
2021
Statut:
ppublish
Résumé
Environmental DNA and metabarcoding allow the identification of a mixture of species and launch a new era in bio- and eco-assessment. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available; each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy. Adding to this complexity, the computation capacity of high-performance computing systems is frequently required for such analyses. To address the difficulties, bioinformatic pipelines need to combine state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune each study. Software containerization technologies ease the sharing and running of software packages across operating systems; thus, they strongly facilitate pipeline development and usage. Likewise programming languages specialized for big data pipelines incorporate features like roll-back checkpoints and on-demand partial pipeline execution. PEMA is a containerized assembly of key metabarcoding analysis tools that requires low effort in setting up, running, and customizing to researchers' needs. Based on third-party tools, PEMA performs read pre-processing, (molecular) operational taxonomic unit clustering, amplicon sequence variant inference, and taxonomy assignment for 16S and 18S ribosomal RNA, as well as ITS and COI marker gene data. Owing to its simplified parameterization and checkpoint support, PEMA allows users to explore alternative algorithms for specific steps of the pipeline without the need of a complete re-execution. PEMA was evaluated against both mock communities and previously published datasets and achieved results of comparable quality. A high-performance computing-based approach was used to develop PEMA; however, it can be used in personal computers as well. PEMA's time-efficient performance and good results will allow it to be used for accurate environmental DNA metabarcoding analysis, thus enhancing the applicability of next-generation biodiversity assessment studies.
Sections du résumé
BACKGROUND
Environmental DNA and metabarcoding allow the identification of a mixture of species and launch a new era in bio- and eco-assessment. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available; each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy. Adding to this complexity, the computation capacity of high-performance computing systems is frequently required for such analyses. To address the difficulties, bioinformatic pipelines need to combine state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune each study. Software containerization technologies ease the sharing and running of software packages across operating systems; thus, they strongly facilitate pipeline development and usage. Likewise programming languages specialized for big data pipelines incorporate features like roll-back checkpoints and on-demand partial pipeline execution.
FINDINGS
PEMA is a containerized assembly of key metabarcoding analysis tools that requires low effort in setting up, running, and customizing to researchers' needs. Based on third-party tools, PEMA performs read pre-processing, (molecular) operational taxonomic unit clustering, amplicon sequence variant inference, and taxonomy assignment for 16S and 18S ribosomal RNA, as well as ITS and COI marker gene data. Owing to its simplified parameterization and checkpoint support, PEMA allows users to explore alternative algorithms for specific steps of the pipeline without the need of a complete re-execution. PEMA was evaluated against both mock communities and previously published datasets and achieved results of comparable quality.
CONCLUSIONS
A high-performance computing-based approach was used to develop PEMA; however, it can be used in personal computers as well. PEMA's time-efficient performance and good results will allow it to be used for accurate environmental DNA metabarcoding analysis, thus enhancing the applicability of next-generation biodiversity assessment studies.
Identifiants
pubmed: 32161947
pii: 5803335
doi: 10.1093/gigascience/giaa022
pmc: PMC7066391
pii:
doi:
Substances chimiques
DNA, Environmental
0
RNA, Ribosomal, 16S
0
RNA, Ribosomal, 18S
0
Electron Transport Complex IV
EC 1.9.3.1
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Commentaires et corrections
Type : ErratumIn
Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
PeerJ. 2015 Dec 10;3:e1420
pubmed: 26713226
Appl Environ Microbiol. 2007 Aug;73(16):5261-7
pubmed: 17586664
PLoS One. 2017 May 11;12(5):e0177459
pubmed: 28494014
Mol Ecol. 2012 Apr;21(8):1834-47
pubmed: 22486822
Nat Commun. 2017 Jan 18;8:14087
pubmed: 28098255
Bioinformatics. 2007 Jan 1;23(1):127-8
pubmed: 17050570
Nucleic Acids Res. 2019 Jan 8;47(D1):D84-D88
pubmed: 30395270
Gigascience. 2019 Apr 1;8(4):
pubmed: 30997489
Mol Ecol Resour. 2016 Jan;16(1):176-82
pubmed: 25959493
Bioinformatics. 2011 Mar 1;27(5):611-8
pubmed: 21233169
Sci Data. 2017 Mar 14;4:170027
pubmed: 28291235
Microbiome. 2014 Sep 30;2(1):30
pubmed: 27367037
Syst Biol. 2019 Mar 1;68(2):365-369
pubmed: 30165689
BMC Bioinformatics. 2009 Dec 15;10:421
pubmed: 20003500
BMC Genomics. 2013;14 Suppl 1:S7
pubmed: 23368723
Mol Ecol Resour. 2018 May;18(3):541-556
pubmed: 29389073
Nat Biotechnol. 2016 Sep;34(9):942-9
pubmed: 27454739
Nucleic Acids Res. 2002 Jul 15;30(14):3059-66
pubmed: 12136088
Appl Environ Microbiol. 2016 Sep 16;82(19):5878-91
pubmed: 27451454
Mol Ecol Resour. 2018 Apr 18;:
pubmed: 29667329
Appl Environ Microbiol. 2009 Dec;75(23):7537-41
pubmed: 19801464
PLoS One. 2013 Apr 22;8(4):e61217
pubmed: 23630581
PeerJ. 2014 Sep 25;2:e593
pubmed: 25276506
Bioinformatics. 2014 Aug 1;30(15):2114-20
pubmed: 24695404
Bioinformatics. 2019 Apr 1;35(7):1151-1158
pubmed: 30169747
PeerJ. 2017 Oct 13;5:e3687
pubmed: 29043106
mSphere. 2018 Jul 18;3(4):
pubmed: 30021874
ISME J. 2017 Dec;11(12):2639-2643
pubmed: 28731476
Gigascience. 2020 Mar 1;9(3):
pubmed: 32161947
Nucleic Acids Res. 2019 Jan 8;47(D1):D259-D264
pubmed: 30371820
Bioinformatics. 2010 Oct 1;26(19):2460-1
pubmed: 20709691
Methods Ecol Evol. 2015 Aug;6(8):973-980
pubmed: 27570615
Bioinformatics. 2015 Jan 1;31(1):10-6
pubmed: 25189778
Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6
pubmed: 23193283
Bioinformatics. 2019 Nov 1;35(21):4453-4455
pubmed: 31070718
BMC Bioinformatics. 2012 Feb 14;13:31
pubmed: 22333067
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
PeerJ. 2016 Oct 18;4:e2584
pubmed: 27781170
PLoS One. 2012;7(11):e49334
pubmed: 23145153
Ecol Lett. 2013 Oct;16(10):1245-57
pubmed: 23910579
Mol Ecol Notes. 2007 May 1;7(3):355-364
pubmed: 18784790
Nat Biotechnol. 2019 Aug;37(8):852-857
pubmed: 31341288
Nucleic Acids Res. 2018 Jan 4;46(D1):D41-D47
pubmed: 29140468