Grouping of genomic markers in populations with family structure.
Clustering
Covariance matrix
Group lasso
SNP-BLUP
Single nucleotide polymorphism
TagSNP
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
19 Feb 2021
19 Feb 2021
Historique:
received:
05
08
2020
accepted:
08
02
2021
entrez:
20
2
2021
pubmed:
21
2
2021
medline:
10
3
2021
Statut:
epublish
Résumé
Linkage and linkage disequilibrium (LD) between genome regions cause dependencies among genomic markers. Due to family stratification in populations with non-random mating in livestock or crop, the standard measures of population LD such as [Formula: see text] may be biased. Grouping of markers according to their interdependence needs to account for the actual population structure in order to allow proper inference in genome-based evaluations. Given a matrix reflecting the strength of association between markers, groups are built successively using a greedy algorithm; largest groups are built at first. As an option, a representative marker is selected for each group. We provide an implementation of the grouping approach as a new function to the R package hscovar. This package enables the calculation of the theoretical covariance between biallelic markers for half- or full-sib families and the derivation of representative markers. In case studies, we have shown that the number of groups comprising dependent markers was smaller and representative SNPs were spread more uniformly over the investigated chromosome region when the family stratification was respected compared to a population-LD approach. In a simulation study, we observed that sensitivity and specificity of a genome-based association study improved if selection of representative markers took family structure into account. Chromosome segments which frequently recombine in the underlying population can be identified from the matrix of pairwise dependence between markers. Representative markers can be exploited, for instance, for dimension reduction prior to a genome-based association study or the grouping structure itself can be employed in a grouped penalization approach.
Sections du résumé
BACKGROUND
BACKGROUND
Linkage and linkage disequilibrium (LD) between genome regions cause dependencies among genomic markers. Due to family stratification in populations with non-random mating in livestock or crop, the standard measures of population LD such as [Formula: see text] may be biased. Grouping of markers according to their interdependence needs to account for the actual population structure in order to allow proper inference in genome-based evaluations.
RESULTS
RESULTS
Given a matrix reflecting the strength of association between markers, groups are built successively using a greedy algorithm; largest groups are built at first. As an option, a representative marker is selected for each group. We provide an implementation of the grouping approach as a new function to the R package hscovar. This package enables the calculation of the theoretical covariance between biallelic markers for half- or full-sib families and the derivation of representative markers. In case studies, we have shown that the number of groups comprising dependent markers was smaller and representative SNPs were spread more uniformly over the investigated chromosome region when the family stratification was respected compared to a population-LD approach. In a simulation study, we observed that sensitivity and specificity of a genome-based association study improved if selection of representative markers took family structure into account.
CONCLUSIONS
CONCLUSIONS
Chromosome segments which frequently recombine in the underlying population can be identified from the matrix of pairwise dependence between markers. Representative markers can be exploited, for instance, for dimension reduction prior to a genome-based association study or the grouping structure itself can be employed in a grouped penalization approach.
Identifiants
pubmed: 33607943
doi: 10.1186/s12859-021-04010-0
pii: 10.1186/s12859-021-04010-0
pmc: PMC7893918
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
79Subventions
Organisme : Deutsche Forschungsgemeinschaft
ID : WI 4450/2-1
Références
Genome Biol. 2013;14(9):R103
pubmed: 24050704
Proc Natl Acad Sci U S A. 2016 Feb 23;113(8):E987-96
pubmed: 26858403
Nat Genet. 2006 Aug;38(8):879-87
pubmed: 16832355
Science. 2002 Jun 21;296(5576):2225-9
pubmed: 12029063
Front Genet. 2018 Jun 05;9:186
pubmed: 29922330
Theor Appl Genet. 1968 Jun;38(6):226-31
pubmed: 24442307
BMC Bioinformatics. 2015 May 08;16:148
pubmed: 25951947
Hum Hered. 2007;64(1):45-51
pubmed: 17483596
Bioinformatics. 2018 Feb 1;34(3):388-397
pubmed: 29028986
Am J Hum Genet. 2007 Nov;81(5):1084-97
pubmed: 17924348
Am J Hum Genet. 2004 Jan;74(1):106-20
pubmed: 14681826
BMC Genet. 2020 Jun 29;21(1):66
pubmed: 32600319
Bioinformatics. 2017 Jul 15;33(14):2078-2081
pubmed: 28334342
BMC Bioinformatics. 2014 Jun 07;15:172
pubmed: 24906803
J Dairy Sci. 2012 Jul;95(7):4065-73
pubmed: 22720963