Detecting selection in low-coverage high-throughput sequencing data using principal component analysis.
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
29 Sep 2021
29 Sep 2021
Historique:
received:
04
03
2021
accepted:
10
09
2021
entrez:
30
9
2021
pubmed:
1
10
2021
medline:
2
10
2021
Statut:
epublish
Résumé
Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.
Sections du résumé
BACKGROUND
BACKGROUND
Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.
MATERIALS AND METHODS
METHODS
We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.
RESULTS
RESULTS
Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.
CONCLUSION
CONCLUSIONS
We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.
Identifiants
pubmed: 34587903
doi: 10.1186/s12859-021-04375-2
pii: 10.1186/s12859-021-04375-2
pmc: PMC8480091
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
470Subventions
Organisme : Novo Nordisk Fonden
ID : NNF20OC0061343
Organisme : Det Frie Forskningsråd
ID : DFF 0135-00211B
Informations de copyright
© 2021. The Author(s).
Références
Am J Hum Genet. 2004 Jun;74(6):1111-20
pubmed: 15114531
Mol Ecol. 2019 Dec;28(24):5232-5247
pubmed: 31647597
Science. 2010 Jul 2;329(5987):75-8
pubmed: 20595611
PLoS Biol. 2006 Mar;4(3):e72
pubmed: 16494531
Bioinformatics. 2011 Nov 1;27(21):2987-93
pubmed: 21903627
Nat Commun. 2019 Oct 22;10(1):4811
pubmed: 31641125
PLoS One. 2012;7(7):e37558
pubmed: 22911679
Mol Ecol Resour. 2019 Sep;19(5):1144-1152
pubmed: 30977299
Mol Biol Evol. 2018 Nov 1;35(11):2736-2750
pubmed: 30169787
Evolution. 2008 Dec;62(12):2984-94
pubmed: 18752601
Cell. 2018 Oct 4;175(2):347-359.e14
pubmed: 30290141
Mol Biol Evol. 2021 Jun 25;38(7):2967-2985
pubmed: 33624816
Science. 2016 Oct 7;354(6308):54-59
pubmed: 27846491
Genetics. 2018 Oct;210(2):719-731
pubmed: 30131346
Mol Ecol Resour. 2017 Jan;17(1):67-77
pubmed: 27601374
Mol Ecol. 2018 Jan;27(2):339-351
pubmed: 29193392
Biometrics. 1999 Dec;55(4):997-1004
pubmed: 11315092
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Genome Res. 2017 Jun;27(6):1029-1038
pubmed: 28385712
BMC Bioinformatics. 2014 Nov 25;15:356
pubmed: 25420514
Nat Genet. 2019 May;51(5):905-911
pubmed: 31043760
Am J Hum Genet. 2016 Mar 3;98(3):456-472
pubmed: 26924531
Mol Biol Evol. 2020 Jul 1;37(7):2153-2154
pubmed: 32343802
Mol Biol Evol. 2007 Mar;24(3):710-22
pubmed: 17182896
Ecol Evol. 2017 Oct 24;7(23):10143-10157
pubmed: 29238544
Evol Lett. 2020 Aug 19;4(5):430-443
pubmed: 33014419