A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

Alzheimer’s disease Epistasis Feature importances Glaucoma Machine learning Random forest Simulation

Journal

BioData mining

ISSN: 1756-0381

Titre abrégé: BioData Min

Pays: England

ID NLM: 101319161

Informations de publication

Date de publication:
29 Jan 2021

Historique:

received: 01 07 2020

accepted: 13 01 2021

entrez: 30 1 2021

pubmed: 31 1 2021

medline: 31 1 2021

Statut: epublish

Résumé

Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.

CONCLUSIONS CONCLUSIONS

By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Identifiants

DOI: 10.1186/s13040-021-00243-0 PMID: 33514397 PMC: PMC7847145

pubmed: 33514397

doi: 10.1186/s13040-021-00243-0

pii: 10.1186/s13040-021-00243-0

pmc: PMC7847145

doi:

Types de publication

Journal Article

Langues

eng

Pagination

Subventions

Organisme : NIH HHS

ID : LM010098

Pays : United States

Organisme : NLM NIH HHS

ID : R01 LM011360

Pays : United States

Organisme : NIH HHS

ID : AI116794

Pays : United States

Organisme : NLM NIH HHS

ID : R01 LM012601

Pays : United States

Organisme : NLM NIH HHS

ID : R01 LM010098

Pays : United States

Organisme : NIAID NIH HHS

ID : R01 AI116794

Pays : United States

Références

Ann Hum Genet. 2011 Jan;75(1):10-9

pubmed: 21133856

Nat Rev Genet. 2004 Aug;5(8):618-25

pubmed: 15266344

Pac Symp Biocomput. 2015;:195-206

pubmed: 25592581

Genet Epidemiol. 2013 Apr;37(3):283-5

pubmed: 23468157

Biomed Res Int. 2013;2013:432375

pubmed: 24228248

Hum Hered. 2010;69(4):268-84

pubmed: 20357478

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1:S65

pubmed: 19208169

Bioinformatics. 2018 Nov 1;34(21):3711-3718

pubmed: 29757357

Pac Symp Biocomput. 2013;:147-58

pubmed: 23424120

Bioinformatics. 2010 Feb 15;26(4):445-55

pubmed: 20053841

BioData Min. 2017 Dec 11;10:36

pubmed: 29238404

Sci Rep. 2013;3:1099

pubmed: 23346356

Int J Cancer. 2017 May 1;140(9):2075-2084

pubmed: 28124475

Pac Symp Biocomput. 2018;23:548-558

pubmed: 29218913

Front Genet. 2015 Sep 10;6:285

pubmed: 26442103

BMC Med Genet. 2009 Sep 09;10:86

pubmed: 19740415

Int J Genomics. 2017;2017:7208318

pubmed: 28642868

Stat Appl Genet Mol Biol. 2011;10(1):32

pubmed: 22889876

Science. 1993 Aug 13;261(5123):921-3

pubmed: 8346443

J Am Med Inform Assoc. 2013 Jul-Aug;20(4):630-6

pubmed: 23396514

Nat Commun. 2015 Jun 25;6:7432

pubmed: 26109276

Bioessays. 2005 Jun;27(6):637-46

pubmed: 15892116

BMC Bioinformatics. 2016 Mar 31;17:145

pubmed: 27029549

J Theor Biol. 2006 Jul 21;241(2):252-61

pubmed: 16457852

Carcinogenesis. 2014 Mar;35(3):572-7

pubmed: 24325914

Genome Med. 2014 Jun 09;6(6):124

pubmed: 25031624

BMC Bioinformatics. 2007 Jan 25;8:25

pubmed: 17254353

Hum Hered. 2003;56(1-3):73-82

pubmed: 14614241

BioData Min. 2016 Apr 06;9:14

pubmed: 27053949

Neurobiol Aging. 2016 Feb;38:141-150

pubmed: 26827652

Pac Symp Biocomput. 2018;23:259-267

pubmed: 29218887

Bioinformatics. 2010 May 15;26(10):1340-7

pubmed: 20385727

Nat Rev Genet. 2008 Nov;9(11):855-67

pubmed: 18852697

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Pagination

Subventions

Références

Auteurs

Alena Orlenko (A)

Jason H Moore (JH)

Classifications MeSH