GEMmaker: process massive RNA-seq datasets on heterogeneous computational infrastructure.

Differential gene expression Gene co-expression network Gene expression matrix Nextflow RNA-seq Workflows

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
02 May 2022
Historique:
received: 03 09 2021
accepted: 07 03 2022
entrez: 3 5 2022
pubmed: 4 5 2022
medline: 6 5 2022
Statut: epublish

Résumé

Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.

Sections du résumé

BACKGROUND BACKGROUND
Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures.
RESULTS RESULTS
GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage.
CONCLUSIONS CONCLUSIONS
Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.

Identifiants

pubmed: 35501696
doi: 10.1186/s12859-022-04629-7
pii: 10.1186/s12859-022-04629-7
pmc: PMC9063052
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

156

Subventions

Organisme : National Science Foundation
ID : 1659300
Organisme : Washington Tree Fruit Research Commission
ID : AP-19-103
Organisme : Washington State University
ID : Emerging Research Initiatives
Organisme : Washington State University
ID : Livestock Health
Organisme : Washington State University
ID : Food Security program award
Organisme : U.S. Department of Agriculture
ID : 1014919
Organisme : McIntyre Stennis
ID : WNP00009

Informations de copyright

© 2022. The Author(s).

Références

BMC Bioinformatics. 2016 Jan 06;17:21
pubmed: 26738481
Nucleic Acids Res. 2018 Jul 2;46(W1):W537-W544
pubmed: 29790989
Gigascience. 2018 Dec 1;7(12):
pubmed: 30277498
Nat Biotechnol. 2015 Mar;33(3):290-5
pubmed: 25690850
Brief Bioinform. 2018 Jul 20;19(4):622-626
pubmed: 28096075
Stat Biosci. 2013 May 1;5(1):198-219
pubmed: 23667399
Front Plant Sci. 2018 Nov 29;9:1770
pubmed: 30555503
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
BMC Bioinformatics. 2008 Dec 29;9:559
pubmed: 19114008
Nat Methods. 2015 Apr;12(4):357-60
pubmed: 25751142
Nat Genet. 2016 May;48(5):481-7
pubmed: 27019110
Artif Intell Med. 2019 Apr;95:133-145
pubmed: 30420244
Bioinformatics. 2014 Aug 1;30(15):2114-20
pubmed: 24695404
Genome Biol. 2014;15(12):550
pubmed: 25516281
Nat Biotechnol. 2016 May;34(5):525-7
pubmed: 27043002
Nucleic Acids Res. 2016 Jan 4;44(D1):D7-19
pubmed: 26615191
BMC Bioinformatics. 2018 Apr 12;19(1):135
pubmed: 29649993
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Nat Biotechnol. 2020 Mar;38(3):276-278
pubmed: 32055031
Plant Cell. 2016 Oct;28(10):2365-2384
pubmed: 27655842
PLoS One. 2017 May 11;12(5):e0177459
pubmed: 28494014
BMC Bioinformatics. 2018 Feb 19;19(Suppl 1):43
pubmed: 29504905
Nat Rev Genet. 2009 Jan;10(1):57-63
pubmed: 19015660
Bioinformatics. 2016 Oct 1;32(19):3047-8
pubmed: 27312411
Bioinformatics. 2010 Jan 1;26(1):139-40
pubmed: 19910308
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Nat Methods. 2017 Apr;14(4):417-419
pubmed: 28263959

Auteurs

John A Hadish (JA)

Molecular Plant Sciences Program, Washington State University, Pullman, WA, USA.

Tyler D Biggs (TD)

Department of Horticulture, Washington State University, Pullman, WA, USA.

Benjamin T Shealy (BT)

Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA.

M Reed Bender (MR)

Biomedical Data Science and Informatics, Clemson University, Clemson, SC, USA.

Coleman B McKnight (CB)

Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.

Connor Wytko (C)

Department of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, USA.

Melissa C Smith (MC)

Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA.

F Alex Feltus (FA)

Biomedical Data Science and Informatics, Clemson University, Clemson, SC, USA.
Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.
Center for Human Genetics, Clemson University, Greenwood, SC, USA.

Loren Honaas (L)

USDA Agricultural Research Service, Wenatchee, WA, USA.

Stephen P Ficklin (SP)

Molecular Plant Sciences Program, Washington State University, Pullman, WA, USA. stephen.ficklin@wsu.edu.
Department of Horticulture, Washington State University, Pullman, WA, USA. stephen.ficklin@wsu.edu.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH