The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets.
RNA-Seq
depth
duplicate
exonic
quality
sequencing
unmapped
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
13 03 2021
13 03 2021
Historique:
received:
27
08
2020
revised:
27
12
2020
accepted:
07
02
2021
entrez:
13
3
2021
pubmed:
14
3
2021
medline:
17
11
2021
Statut:
ppublish
Résumé
The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]). Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.
Sections du résumé
BACKGROUND
The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis.
FINDINGS
In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).
CONCLUSIONS
Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.
Identifiants
pubmed: 33712853
pii: 6169410
doi: 10.1093/gigascience/giab011
pmc: PMC7955155
pii:
doi:
Substances chimiques
RNA
63231-63-0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : NHGRI NIH HHS
ID : T32 HG008345
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : Howard Hughes Medical Institute
Pays : United States
Informations de copyright
© The Author(s) 2021. Published by Oxford University Press GigaScience.
Références
PeerJ. 2017 Mar 16;5:e3091
pubmed: 28321364
BMC Genomics. 2018 Jul 13;19(1):531
pubmed: 30001700
Nat Biotechnol. 2016 May;34(5):525-7
pubmed: 27043002
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Gigascience. 2021 Mar 13;10(3):
pubmed: 33712853
Nat Biotechnol. 2013 Nov;31(11):1015-22
pubmed: 24037425
Nature. 2017 Oct 11;550(7675):204-213
pubmed: 29022597
Bioinformatics. 2014 Sep 1;30(17):2503-5
pubmed: 24812344
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Nat Methods. 2008 Jul;5(7):621-8
pubmed: 18516045
Cell. 2018 Apr 5;173(2):291-304.e6
pubmed: 29625048
JAMA Netw Open. 2019 Oct 2;2(10):e1913968
pubmed: 31651965
Sci Data. 2019 Jun 20;6(1):98
pubmed: 31222016
Bioinformatics. 2012 Aug 15;28(16):2184-5
pubmed: 22743226
Sci Rep. 2016 May 09;6:25533
pubmed: 27156886
BMC Bioinformatics. 2011 Aug 04;12:323
pubmed: 21816040
Genome Res. 2008 Sep;18(9):1509-17
pubmed: 18550803