The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets.

Child Gene Expression Profiling High-Throughput Nucleotide Sequencing Humans Neoplasms / genetics RNA Reproducibility of Results Sequence Analysis, RNA Exome Sequencing

RNA-Seq depth duplicate exonic quality sequencing unmapped

Journal

GigaScience

ISSN: 2047-217X

Titre abrégé: Gigascience

Pays: United States

ID NLM: 101596872

Informations de publication

Date de publication:
13 03 2021

Historique:

received: 27 08 2020

revised: 27 12 2020

accepted: 07 02 2021

entrez: 13 3 2021

pubmed: 14 3 2021

medline: 17 11 2021

Statut: ppublish

Résumé

The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]). Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.

Sections du résumé

BACKGROUND

FINDINGS

In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).

CONCLUSIONS

Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.

Identifiants

DOI: 10.1093/gigascience/giab011 PMID: 33712853 PMC: PMC7955155

pubmed: 33712853

pii: 6169410

doi: 10.1093/gigascience/giab011

pmc: PMC7955155

pii:

doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : NHGRI NIH HHS

ID : T32 HG008345

Pays : United States

Organisme : NHGRI NIH HHS

ID : U01 HG010971

Pays : United States

Organisme : Howard Hughes Medical Institute

Pays : United States

Informations de copyright

Références

PeerJ. 2017 Mar 16;5:e3091

pubmed: 28321364

BMC Genomics. 2018 Jul 13;19(1):531

pubmed: 30001700

Nat Biotechnol. 2016 May;34(5):525-7

pubmed: 27043002

Nat Biotechnol. 2017 Apr 11;35(4):314-316

pubmed: 28398314

Gigascience. 2021 Mar 13;10(3):

pubmed: 33712853

Nat Biotechnol. 2013 Nov;31(11):1015-22

pubmed: 24037425

Nature. 2017 Oct 11;550(7675):204-213

pubmed: 29022597

Bioinformatics. 2014 Sep 1;30(17):2503-5

pubmed: 24812344

Bioinformatics. 2013 Jan 1;29(1):15-21

pubmed: 23104886

Nat Methods. 2008 Jul;5(7):621-8

pubmed: 18516045

Cell. 2018 Apr 5;173(2):291-304.e6

pubmed: 29625048

JAMA Netw Open. 2019 Oct 2;2(10):e1913968

pubmed: 31651965

Sci Data. 2019 Jun 20;6(1):98

pubmed: 31222016

Bioinformatics. 2012 Aug 15;28(16):2184-5

pubmed: 22743226

Sci Rep. 2016 May 09;6:25533

pubmed: 27156886

BMC Bioinformatics. 2011 Aug 04;12:323

pubmed: 21816040

Genome Res. 2008 Sep;18(9):1509-17

pubmed: 18550803

The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Holly C Beale (HC)

Jacquelyn M Roger (JM)

Matthew A Cattle (MA)

Liam T McKay (LT)

Drew K A Thompson (DKA)

Katrina Learned (K)

A Geoffrey Lyle (AG)

Ellen T Kephart (ET)

Rob Currie (R)

Du Linh Lam (DL)

Lauren Sanders (L)

Jacob Pfeil (J)

John Vivian (J)

Isabel Bjork (I)

Sofie R Salama (SR)

David Haussler (D)

Olena M Vaske (OM)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH