The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets.


Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
13 03 2021
Historique:
received: 27 08 2020
revised: 27 12 2020
accepted: 07 02 2021
entrez: 13 3 2021
pubmed: 14 3 2021
medline: 17 11 2021
Statut: ppublish

Résumé

The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]). Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.

Sections du résumé

BACKGROUND
The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis.
FINDINGS
In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).
CONCLUSIONS
Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.

Identifiants

pubmed: 33712853
pii: 6169410
doi: 10.1093/gigascience/giab011
pmc: PMC7955155
pii:
doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : NHGRI NIH HHS
ID : T32 HG008345
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : Howard Hughes Medical Institute
Pays : United States

Informations de copyright

© The Author(s) 2021. Published by Oxford University Press GigaScience.

Références

PeerJ. 2017 Mar 16;5:e3091
pubmed: 28321364
BMC Genomics. 2018 Jul 13;19(1):531
pubmed: 30001700
Nat Biotechnol. 2016 May;34(5):525-7
pubmed: 27043002
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Gigascience. 2021 Mar 13;10(3):
pubmed: 33712853
Nat Biotechnol. 2013 Nov;31(11):1015-22
pubmed: 24037425
Nature. 2017 Oct 11;550(7675):204-213
pubmed: 29022597
Bioinformatics. 2014 Sep 1;30(17):2503-5
pubmed: 24812344
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Nat Methods. 2008 Jul;5(7):621-8
pubmed: 18516045
Cell. 2018 Apr 5;173(2):291-304.e6
pubmed: 29625048
JAMA Netw Open. 2019 Oct 2;2(10):e1913968
pubmed: 31651965
Sci Data. 2019 Jun 20;6(1):98
pubmed: 31222016
Bioinformatics. 2012 Aug 15;28(16):2184-5
pubmed: 22743226
Sci Rep. 2016 May 09;6:25533
pubmed: 27156886
BMC Bioinformatics. 2011 Aug 04;12:323
pubmed: 21816040
Genome Res. 2008 Sep;18(9):1509-17
pubmed: 18550803

Auteurs

Holly C Beale (HC)

UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.
UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Jacquelyn M Roger (JM)

UC Santa Cruz, School of Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.

Matthew A Cattle (MA)

UC Santa Cruz, School of Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.

Liam T McKay (LT)

UC Santa Cruz, School of Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.

Drew K A Thompson (DKA)

UC Santa Cruz, School of Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.

Katrina Learned (K)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

A Geoffrey Lyle (AG)

UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.
UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Ellen T Kephart (ET)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Rob Currie (R)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Du Linh Lam (DL)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Lauren Sanders (L)

UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.

Jacob Pfeil (J)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

John Vivian (J)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Isabel Bjork (I)

UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Sofie R Salama (SR)

UC Santa Cruz, Department of Biomolecular Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.
Howard Hughes Medical Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

David Haussler (D)

UC Santa Cruz, Department of Biomolecular Engineering, 1156 High Street, Santa Cruz, CA 95064, USA.
Howard Hughes Medical Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Olena M Vaske (OM)

UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.
UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH