binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
28 Aug 2020
Historique:
received: 26 03 2020
accepted: 19 08 2020
entrez: 30 8 2020
pubmed: 30 8 2020
medline: 30 9 2020
Statut: epublish

Résumé

In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the "P > > N" high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers' main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers' main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.

Sections du résumé

BACKGROUND BACKGROUND
In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the "P > > N" high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.
RESULTS RESULTS
In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers' main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.
CONCLUSION CONCLUSIONS
binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers' main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.

Identifiants

pubmed: 32859146
doi: 10.1186/s12859-020-03718-9
pii: 10.1186/s12859-020-03718-9
pmc: PMC7456085
doi:

Substances chimiques

Biomarkers 0
Biomarkers, Tumor 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

374

Subventions

Organisme : NIAID NIH HHS
ID : U01 AI122275
Pays : United States
Organisme : NCI NIH HHS
ID : P30CA023074
Pays : United States
Organisme : NIH HHS
ID : 1UG3OD023171
Pays : United States

Commentaires et corrections

Type : ErratumIn

Références

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W339-44
pubmed: 17553836
BMC Bioinformatics. 2007 Sep 03;8:328
pubmed: 17767709
Hum Hered. 2011;72(2):121-32
pubmed: 21996641
N Engl J Med. 2018 Mar 15;378(11):981-983
pubmed: 29539284
BMC Bioinformatics. 2006 Jan 06;7:3
pubmed: 16398926
BioData Min. 2016 Apr 06;9:14
pubmed: 27053949
Bioinformatics. 2010 May 15;26(10):1340-7
pubmed: 20385727
BMC Bioinformatics. 2013 Jan 16;14:5
pubmed: 23323760
Ann N Y Acad Sci. 2004 May;1020:154-74
pubmed: 15208191
Int J Med Inform. 2020 Sep;141:104148
pubmed: 32535186
Bioinformatics. 2016 Mar 15;32(6):952-4
pubmed: 26568634
Theory Biosci. 2012 Dec;131(4):281-5
pubmed: 22872506
BioData Min. 2016 Feb 01;9:7
pubmed: 26839594
Nat Genet. 2000 May;25(1):25-9
pubmed: 10802651
Brief Bioinform. 2019 Mar 22;20(2):492-503
pubmed: 29045534
Genomics. 2009 Dec;94(6):423-32
pubmed: 19699293
J Am Med Inform Assoc. 2014 Nov-Dec;21(6):1015-25
pubmed: 25301808
Genomics. 2012 Jun;99(6):323-9
pubmed: 22546560
AMIA Annu Symp Proc. 2020 Mar 04;2019:582-591
pubmed: 32308852
BMJ. 2019 Mar 12;364:l886
pubmed: 30862612
BioData Min. 2017 Jun 27;10:21
pubmed: 28674556
Pac Symp Biocomput. 2018;23:484-495
pubmed: 29218907
Gene. 2013 Apr 10;518(1):179-86
pubmed: 23219997
BMC Genet. 2010 Jun 14;11:49
pubmed: 20546594
J Am Med Inform Assoc. 2017 Nov 1;24(6):1116-1126
pubmed: 29016970

Auteurs

Samir Rachid Zaim (S)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.

Colleen Kenost (C)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.

Joanne Berghout (J)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.

Wesley Chiu (W)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.

Liam Wilson (L)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.

Hao Helen Zhang (HH)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA. hzhang@math.arizona.edu.
The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA. hzhang@math.arizona.edu.
Department of Mathematics, College of Sciences, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA. hzhang@math.arizona.edu.

Yves A Lussier (YA)

Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.
The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.
College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.
The Center for Applied Genetic and Genomic Medicine, 1295 N. Martin, Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.
The University of Arizona Cancer Center, 3838 N. Campbell Ave, Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.
The University of Arizona BIO5 Institute, 1657 E. Helen Street, Tucson, AZ, 85721, USA. Lussier.Y@gmail.com.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH