What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics.


Journal

Human genetics
ISSN: 1432-1203
Titre abrégé: Hum Genet
Pays: Germany
ID NLM: 7613873

Informations de publication

Date de publication:
Sep 2022
Historique:
received: 13 05 2021
accepted: 08 11 2021
pubmed: 5 12 2021
medline: 11 8 2022
entrez: 4 12 2021
Statut: ppublish

Résumé

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

Identifiants

pubmed: 34862561
doi: 10.1007/s00439-021-02402-z
pii: 10.1007/s00439-021-02402-z
pmc: PMC9360120
doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

IM

Pagination

1515-1528

Subventions

Organisme : NHGRI NIH HHS
ID : Intramural Research Program
Pays : United States
Organisme : NHGRI NIH HHS
ID : Intramural Research Program
Pays : United States

Informations de copyright

© 2021. This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.

Références

Bioinformatics. 2020 Mar 1;36(6):1772-1778
pubmed: 31702773
Hum Mutat. 2013 Jan;34(1):57-65
pubmed: 23033316
Nat Biotechnol. 2018 Nov;36(10):983-987
pubmed: 30247488
Stat Anal Data Min. 2013 Dec 1;6(6):496-505
pubmed: 24501613
Nat Rev Genet. 2019 Jul;20(7):389-403
pubmed: 30971806
Nucleic Acids Res. 2019 Jan 8;47(D1):D886-D894
pubmed: 30371827
Eur J Hum Genet. 2014 Feb;22(2):254-9
pubmed: 23695277
Brief Bioinform. 2011 Jul;12(4):369-73
pubmed: 21498552
Artif Intell Med. 2011 Sep;53(1):47-56
pubmed: 21775110
Nat Commun. 2016 Jan 04;7:10259
pubmed: 26725330
BMC Bioinformatics. 2020 Oct 1;21(1):430
pubmed: 32998684
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1:S65
pubmed: 19208169
Bioinformatics. 2018 Nov 1;34(21):3711-3718
pubmed: 29757357
BMC Genomics. 2014;15 Suppl 7:S4
pubmed: 25573332
Nucleic Acids Res. 2021 Jan 8;49(D1):D1311-D1320
pubmed: 33045747
Bioinformatics. 2010 Feb 15;26(4):445-55
pubmed: 20053841
BMC Genomics. 2013;14 Suppl 3:S3
pubmed: 23819870
J Am Med Inform Assoc. 2017 Jan;24(1):162-171
pubmed: 27497800
Nucleic Acids Res. 2003 Jul 1;31(13):3812-4
pubmed: 12824425
BioData Min. 2017 May 19;10:16
pubmed: 28533819
Nucleic Acids Res. 2018 Jun 20;46(11):e69
pubmed: 29617928
Nat Commun. 2020 Nov 30;11(1):6130
pubmed: 33257650
BMC Proc. 2016 Oct 18;10(Suppl 7):147-152
pubmed: 27980627
Big Data. 2019 Dec;7(4):221-248
pubmed: 31411491
BioData Min. 2016 Feb 01;9:7
pubmed: 26839594
BMC Genet. 2004 Dec 10;5:32
pubmed: 15588316
J Comput Chem. 2003 Apr 30;24(6):727-31
pubmed: 12666164
BioData Min. 2021 Jan 29;14(1):9
pubmed: 33514397
Am J Hum Genet. 2009 Sep;85(3):309-20
pubmed: 19733727
BioData Min. 2014 Dec 18;7(1):28
pubmed: 25614764
Cancer Genomics Proteomics. 2018 Jan-Feb;15(1):41-51
pubmed: 29275361
PLoS One. 2012;7(10):e48375
pubmed: 23152744
Expert Rev Mol Diagn. 2018 Mar;18(3):219-226
pubmed: 29431517
Gene. 2017 Mar 10;604:33-40
pubmed: 27998790
Brief Bioinform. 2012 May;13(3):292-304
pubmed: 21908865
Biosci Rep. 2020 Jun 26;40(6):
pubmed: 32496505
PLoS One. 2014 Jul 15;9(7):e102483
pubmed: 25025207
Sci Rep. 2018 Dec 3;8(1):17546
pubmed: 30510242
Brief Bioinform. 2019 Mar 22;20(2):492-503
pubmed: 29045534
Bioinformatics. 2014 Mar 1;30(5):698-705
pubmed: 24149050
BMC Bioinformatics. 2008 Jul 11;9:307
pubmed: 18620558
BioData Min. 2010 Sep 27;3(1):5
pubmed: 20875103
Nat Genet. 2018 Aug;50(8):1161-1170
pubmed: 30038395
Bioinformatics. 2020 Jan 1;36(1):250-256
pubmed: 31165141
PLoS One. 2013 Dec 10;8(12):e81527
pubmed: 24339943
Genet Epidemiol. 2009;33 Suppl 1:S51-7
pubmed: 19924717
Nat Genet. 2018 Aug;50(8):1171-1179
pubmed: 30013180
Bioinformatics. 2010 Jul 15;26(14):1752-8
pubmed: 20505004
Cancer Cell Int. 2020 Jun 17;20:251
pubmed: 32565735
BMC Bioinformatics. 2014 Jun 17;15:193
pubmed: 24934728
Genome Res. 2018 May;28(5):739-750
pubmed: 29588361
BMC Bioinformatics. 2017 Mar 21;18(1):184
pubmed: 28327091
Pac Symp Biocomput. 2015;:195-206
pubmed: 25592581
J Theor Biol. 2006 Jul 21;241(2):252-61
pubmed: 16457852
Bioinformatics. 2018 Sep 1;34(17):i629-i637
pubmed: 30423062
Endocrine. 2021 Jun;72(3):758-783
pubmed: 33179221
Am J Hum Genet. 2016 Oct 6;99(4):877-885
pubmed: 27666373
Nat Methods. 2015 Oct;12(10):931-4
pubmed: 26301843
Nat Genet. 2014 Mar;46(3):310-5
pubmed: 24487276
Bioinformatics. 2019 Apr 15;35(8):1358-1365
pubmed: 30239600
PLoS Comput Biol. 2020 Feb 3;16(2):e1007616
pubmed: 32012148
J Am Coll Cardiol. 2020 Mar 24;75(11):1281-1295
pubmed: 32192654
BioData Min. 2016 Apr 06;9:14
pubmed: 27053949
Zool Res. 2021 Mar 18;42(2):246-249
pubmed: 33709636
Genome Res. 2016 Jul;26(7):990-9
pubmed: 27197224
BMC Genet. 2007 Jul 18;8:49
pubmed: 17640352
BMC Genomics. 2016 Dec 22;17(Suppl 13):1025
pubmed: 28155657
BioData Min. 2009 Sep 22;2(1):5
pubmed: 19772641
AMIA Annu Symp Proc. 2018 Dec 05;2018:1358-1367
pubmed: 30815180
Bioinformatics. 2021 Apr 09;:
pubmed: 33837381
PLoS Comput Biol. 2019 Dec 18;15(12):e1007556
pubmed: 31851693
Bioinformatics. 2010 May 15;26(10):1340-7
pubmed: 20385727

Auteurs

Anthony M Musolf (AM)

Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.

Emily R Holzinger (ER)

Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA.

James D Malley (JD)

Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.

Joan E Bailey-Wilson (JE)

Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA. jebw@mail.nih.gov.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH