Improving the power of gene set enrichment analyses.

Breast Neoplasms / genetics Cohort Studies Databases as Topic Female Gene Expression Profiling Gene Expression Regulation, Neoplastic Humans Phenotype RNA, Messenger / genetics Sample Size

Enrichment analysis Gene set enrichment analysis Statistical power

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
17 May 2019

Historique:

received: 26 10 2018

accepted: 25 04 2019

entrez: 19 5 2019

pubmed: 19 5 2019

medline: 21 6 2019

Statut: epublish

Résumé

Set enrichment methods are commonly used to analyze high-dimensional molecular data and gain biological insight into molecular or clinical phenotypes. One important category of analysis methods employs an enrichment score, which is created from ranked univariate correlations between phenotype and each molecular attribute. Estimates of the significance of the associations are determined via a null distribution generated from phenotype permutation. We investigate some statistical properties of this method and demonstrate how alternative assessments of enrichment can be used to increase the statistical power of such analyses to detect associations between phenotype and biological processes and pathways. For this category of set enrichment analysis, the null distribution is largely independent of the number of samples with available molecular data. Hence, providing the sample cohort is not too small, we show that increased statistical power to identify associations between biological processes and phenotype can be achieved by splitting the cohort into two halves and using the average of the enrichment scores evaluated for each half as an alternative test statistic. Further, we demonstrate that this principle can be extended by averaging over multiple random splits of the cohort into halves. This enables the calculation of an enrichment statistic and associated p value of arbitrary precision, independent of the exact random splits used. It is possible to increase the statistical power of gene set enrichment analyses that employ enrichment scores created from running sums of univariate phenotype-attribute correlations and phenotype-permutation generated null distributions. This increase can be achieved by using alternative test statistics that average enrichment scores calculated for splits of the dataset. Apart from the special case of a close balance between up- and down-regulated genes within a gene set, statistical power can be improved, or at least maintained, by this method down to small sample sizes, where accurate assessment of univariate phenotype-gene correlations becomes unfeasible.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

For this category of set enrichment analysis, the null distribution is largely independent of the number of samples with available molecular data. Hence, providing the sample cohort is not too small, we show that increased statistical power to identify associations between biological processes and phenotype can be achieved by splitting the cohort into two halves and using the average of the enrichment scores evaluated for each half as an alternative test statistic. Further, we demonstrate that this principle can be extended by averaging over multiple random splits of the cohort into halves. This enables the calculation of an enrichment statistic and associated p value of arbitrary precision, independent of the exact random splits used.

CONCLUSIONS CONCLUSIONS

It is possible to increase the statistical power of gene set enrichment analyses that employ enrichment scores created from running sums of univariate phenotype-attribute correlations and phenotype-permutation generated null distributions. This increase can be achieved by using alternative test statistics that average enrichment scores calculated for splits of the dataset. Apart from the special case of a close balance between up- and down-regulated genes within a gene set, statistical power can be improved, or at least maintained, by this method down to small sample sizes, where accurate assessment of univariate phenotype-gene correlations becomes unfeasible.

Identifiants

DOI: 10.1186/s12859-019-2850-1 PMID: 31101008 PMC: PMC6525372

pubmed: 31101008

doi: 10.1186/s12859-019-2850-1

pii: 10.1186/s12859-019-2850-1

pmc: PMC6525372

doi:

Substances chimiques

RNA, Messenger 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

257

Références

Nature. 2002 Jan 31;415(6871):530-6

pubmed: 11823860

N Engl J Med. 2002 Dec 19;347(25):1999-2009

pubmed: 12490681

Nat Genet. 2003 Jul;34(3):267-73

pubmed: 12808457

Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50

pubmed: 16199517

BMC Bioinformatics. 2009 Feb 03;10:47

pubmed: 19192285

Methods Mol Biol. 2009;563:99-121

pubmed: 19597782

PLoS Comput Biol. 2011 Oct;7(10):e1002240

pubmed: 22028643

Stat Methods Med Res. 2016 Feb;25(1):472-87

pubmed: 23070592

PLoS One. 2013 Nov 15;8(11):e79217

pubmed: 24260172

Cell Syst. 2015 Dec 23;1(6):417-425

pubmed: 26771021

BMC Bioinformatics. 2017 May 12;18(1):256

pubmed: 28499413

Improving the power of gene set enrichment analyses.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Références

Auteurs

Joanna Roder (J)

Benjamin Linstid (B)

Carlos Oliveira (C)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH