A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

Alzheimer’s disease Epistasis Feature importances Glaucoma Machine learning Random forest Simulation

Journal

BioData mining
ISSN: 1756-0381
Titre abrégé: BioData Min
Pays: England
ID NLM: 101319161

Informations de publication

Date de publication:
29 Jan 2021
Historique:
received: 01 07 2020
accepted: 13 01 2021
entrez: 30 1 2021
pubmed: 31 1 2021
medline: 31 1 2021
Statut: epublish

Résumé

Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Sections du résumé

BACKGROUND BACKGROUND
Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.
RESULTS RESULTS
To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.
CONCLUSIONS CONCLUSIONS
By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Identifiants

pubmed: 33514397
doi: 10.1186/s13040-021-00243-0
pii: 10.1186/s13040-021-00243-0
pmc: PMC7847145
doi:

Types de publication

Journal Article

Langues

eng

Pagination

9

Subventions

Organisme : NIH HHS
ID : LM010098
Pays : United States
Organisme : NLM NIH HHS
ID : R01 LM011360
Pays : United States
Organisme : NIH HHS
ID : AI116794
Pays : United States
Organisme : NLM NIH HHS
ID : R01 LM012601
Pays : United States
Organisme : NLM NIH HHS
ID : R01 LM010098
Pays : United States
Organisme : NIAID NIH HHS
ID : R01 AI116794
Pays : United States

Références

Ann Hum Genet. 2011 Jan;75(1):10-9
pubmed: 21133856
Nat Rev Genet. 2004 Aug;5(8):618-25
pubmed: 15266344
Pac Symp Biocomput. 2015;:195-206
pubmed: 25592581
Genet Epidemiol. 2013 Apr;37(3):283-5
pubmed: 23468157
Biomed Res Int. 2013;2013:432375
pubmed: 24228248
Hum Hered. 2010;69(4):268-84
pubmed: 20357478
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1:S65
pubmed: 19208169
Bioinformatics. 2018 Nov 1;34(21):3711-3718
pubmed: 29757357
Pac Symp Biocomput. 2013;:147-58
pubmed: 23424120
Bioinformatics. 2010 Feb 15;26(4):445-55
pubmed: 20053841
BioData Min. 2017 Dec 11;10:36
pubmed: 29238404
Sci Rep. 2013;3:1099
pubmed: 23346356
Int J Cancer. 2017 May 1;140(9):2075-2084
pubmed: 28124475
Pac Symp Biocomput. 2018;23:548-558
pubmed: 29218913
Front Genet. 2015 Sep 10;6:285
pubmed: 26442103
BMC Med Genet. 2009 Sep 09;10:86
pubmed: 19740415
Int J Genomics. 2017;2017:7208318
pubmed: 28642868
Stat Appl Genet Mol Biol. 2011;10(1):32
pubmed: 22889876
Science. 1993 Aug 13;261(5123):921-3
pubmed: 8346443
J Am Med Inform Assoc. 2013 Jul-Aug;20(4):630-6
pubmed: 23396514
Nat Commun. 2015 Jun 25;6:7432
pubmed: 26109276
Bioessays. 2005 Jun;27(6):637-46
pubmed: 15892116
BMC Bioinformatics. 2016 Mar 31;17:145
pubmed: 27029549
J Theor Biol. 2006 Jul 21;241(2):252-61
pubmed: 16457852
Carcinogenesis. 2014 Mar;35(3):572-7
pubmed: 24325914
Genome Med. 2014 Jun 09;6(6):124
pubmed: 25031624
BMC Bioinformatics. 2007 Jan 25;8:25
pubmed: 17254353
Hum Hered. 2003;56(1-3):73-82
pubmed: 14614241
BioData Min. 2016 Apr 06;9:14
pubmed: 27053949
Neurobiol Aging. 2016 Feb;38:141-150
pubmed: 26827652
Pac Symp Biocomput. 2018;23:259-267
pubmed: 29218887
Bioinformatics. 2010 May 15;26(10):1340-7
pubmed: 20385727
Nat Rev Genet. 2008 Nov;9(11):855-67
pubmed: 18852697

Auteurs

Alena Orlenko (A)

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.

Jason H Moore (JH)

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA. jhmoore@upenn.edu.

Classifications MeSH