High-dimensional supervised classification in a context of non-independence of observations to identify the determining SNPs in a phenotype.
Correlated variables
High-dimensional supervised classification
Non independence of observations
Phenotype
SNP
Journal
Infectious Disease Modelling
ISSN: 2468-0427
Titre abrégé: Infect Dis Model
Pays: China
ID NLM: 101692406
Informations de publication
Date de publication:
Dec 2023
Dec 2023
Historique:
received:
26
06
2023
revised:
29
08
2023
accepted:
03
09
2023
medline:
20
9
2023
pubmed:
20
9
2023
entrez:
20
9
2023
Statut:
epublish
Résumé
This work addresses the problem of supervised classification for highly correlated high-dimensional data describing non-independent observations to identify SNPs related to a phenotype. We use a general penalized linear mixed model with a single random effect that performs simultaneous SNP selection and population structure adjustment in high-dimensional prediction models. Specifically, the model simultaneously selects variables and estimates their effects, taking into account correlations between individuals. Single nucleotide polymorphisms (SNPs) are a type of genetic variation and each SNP represents a difference in a single DNA building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct source population of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is of great importance. In this study, we used uncorrelated variables from the construction of blocks of correlated variables done in a previous work to describe the most related observations of the dataset. The model was trained with 90% of the observations and tested with the remaining 10%. The best model obtained with the generalized information criterion (GIC) identified the SNP named rs2493311 located on the first chromosome of the gene called PRDM16 ((PR/SET domain 16)) as the most decisive factor in malaria attacks.
Identifiants
pubmed: 37727806
doi: 10.1016/j.idm.2023.09.002
pii: S2468-0427(23)00084-2
pmc: PMC10505671
doi:
Types de publication
Journal Article
Langues
eng
Pagination
1079-1087Informations de copyright
© 2023 The Authors.
Déclaration de conflit d'intérêts
Authors declare no conflict of interest.
Références
Malar J. 2015 Aug 28;14:333
pubmed: 26314886
PLoS One. 2012;7(9):e43987
pubmed: 22957039
Nat Genet. 2014 Feb;46(2):100-6
pubmed: 24473328
PLoS Genet. 2008 Jul 25;4(7):e1000130
pubmed: 18654633
Nat Rev Genet. 2010 Dec;11(12):843-54
pubmed: 21085203
Nat Genet. 2006 Aug;38(8):904-9
pubmed: 16862161
Nature. 2009 Oct 8;461(7265):747-53
pubmed: 19812666
Nat Genet. 2010 Jul;42(7):565-9
pubmed: 20562875
PLoS Genet. 2014 Jul 17;10(7):e1004445
pubmed: 25033443
Nat Genet. 2006 Feb;38(2):203-8
pubmed: 16380716
BMC Proc. 2014 Jun 17;8(Suppl 1):S25
pubmed: 25519315
J Stat Softw. 2010;33(1):1-22
pubmed: 20808728
Bioinformatics. 2013 Jan 15;29(2):206-14
pubmed: 23175758
Nat Methods. 2011 Sep 04;8(10):833-5
pubmed: 21892150
Sci Rep. 2016 Nov 28;6:36671
pubmed: 27892471
Nat Genet. 2010 Apr;42(4):348-54
pubmed: 20208533
PLoS Genet. 2020 May 4;16(5):e1008766
pubmed: 32365090
Genet Epidemiol. 2013 May;37(4):366-76
pubmed: 23529756