Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.


Journal

PLoS biology
ISSN: 1545-7885
Titre abrégé: PLoS Biol
Pays: United States
ID NLM: 101183755

Informations de publication

Date de publication:
11 2019
Historique:
received: 30 06 2019
accepted: 08 10 2019
entrez: 13 11 2019
pubmed: 13 11 2019
medline: 24 3 2020
Statut: epublish

Résumé

Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene-gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.

Identifiants

pubmed: 31714939
doi: 10.1371/journal.pbio.3000481
pii: PBIOLOGY-D-19-01873
pmc: PMC6850523
doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e3000481

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

Nat Rev Genet. 2018 Feb;19(2):93-109
pubmed: 29279605
Trends Genet. 2003 Jul;19(7):362-5
pubmed: 12850439
BMC Bioinformatics. 2011 Dec 17;12:480
pubmed: 22177264
Bioinformatics. 2014 Apr 1;30(7):923-30
pubmed: 24227677
Biol Direct. 2009 Apr 16;4:14
pubmed: 19371405
BMC Bioinformatics. 2010 Feb 18;11:94
pubmed: 20167110
Stat Methods Med Res. 2016 Feb;25(1):472-87
pubmed: 23070592
BMC Bioinformatics. 2007 May 18;8:157
pubmed: 17509157
Genome Biol. 2014;15(12):550
pubmed: 25516281
BMC Bioinformatics. 2007 Jul 05;8:242
pubmed: 17612399
Nucleic Acids Res. 2015 Apr 20;43(7):e47
pubmed: 25605792
BioData Min. 2017 Feb 6;10:5
pubmed: 28184252
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5
pubmed: 23193258
Genome Biol. 2010;11(3):R25
pubmed: 20196867
BMC Bioinformatics. 2004 Dec 09;5:193
pubmed: 15588298
Bioinformatics. 2005 May 1;21(9):1943-9
pubmed: 15647293
Genome Biol. 2010;11(10):R106
pubmed: 20979621
Nucleic Acids Res. 2013 Jan;41(Database issue):D110-7
pubmed: 23161672
Nat Rev Genet. 2009 Jan;10(1):57-63
pubmed: 19015660
Bioinformatics. 2010 Jan 1;26(1):139-40
pubmed: 19910308
Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50
pubmed: 16199517
Genome Biol. 2003;4(4):R28
pubmed: 12702209
Bioessays. 2014 Nov;36(11):1072-81
pubmed: 25213333
PLoS One. 2017 Dec 21;12(12):e0190152
pubmed: 29267363
BMC Bioinformatics. 2015 Oct 28;16:347
pubmed: 26511205
Bioinformatics. 2007 Apr 15;23(8):980-7
pubmed: 17303618
Genome Res. 2012 Sep;22(9):1760-74
pubmed: 22955987
BMC Genomics. 2010 Oct 18;11:574
pubmed: 20955544
Nat Methods. 2008 Jul;5(7):621-8
pubmed: 18516045
BMC Bioinformatics. 2005 Nov 09;6:269
pubmed: 16280084
Genome Biol. 2013 Apr 25;14(4):R36
pubmed: 23618408
Biostatistics. 2012 Apr;13(2):204-16
pubmed: 22285995
Bioinformatics. 2003 Jan 22;19(2):185-93
pubmed: 12538238
Brief Bioinform. 2018 Sep 28;19(5):776-792
pubmed: 28334202
Brief Bioinform. 2013 Nov;14(6):671-83
pubmed: 22988256
Nucleic Acids Res. 2012 Sep 1;40(17):e133
pubmed: 22638577

Auteurs

Shir Mandelboum (S)

School of Molecular Cell Biology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel.
Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel.

Zohar Manber (Z)

Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel.

Orna Elroy-Stein (O)

School of Molecular Cell Biology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel.
Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel.

Ran Elkon (R)

Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel.
Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH