Detecting selection in low-coverage high-throughput sequencing data using principal component analysis.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
29 Sep 2021
Historique:
received: 04 03 2021
accepted: 10 09 2021
entrez: 30 9 2021
pubmed: 1 10 2021
medline: 2 10 2021
Statut: epublish

Résumé

Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Sections du résumé

BACKGROUND BACKGROUND
Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.
MATERIALS AND METHODS METHODS
We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.
RESULTS RESULTS
Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.
CONCLUSION CONCLUSIONS
We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Identifiants

pubmed: 34587903
doi: 10.1186/s12859-021-04375-2
pii: 10.1186/s12859-021-04375-2
pmc: PMC8480091
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

470

Subventions

Organisme : Novo Nordisk Fonden
ID : NNF20OC0061343
Organisme : Det Frie Forskningsråd
ID : DFF 0135-00211B

Informations de copyright

© 2021. The Author(s).

Références

Am J Hum Genet. 2004 Jun;74(6):1111-20
pubmed: 15114531
Mol Ecol. 2019 Dec;28(24):5232-5247
pubmed: 31647597
Science. 2010 Jul 2;329(5987):75-8
pubmed: 20595611
PLoS Biol. 2006 Mar;4(3):e72
pubmed: 16494531
Bioinformatics. 2011 Nov 1;27(21):2987-93
pubmed: 21903627
Nat Commun. 2019 Oct 22;10(1):4811
pubmed: 31641125
PLoS One. 2012;7(7):e37558
pubmed: 22911679
Mol Ecol Resour. 2019 Sep;19(5):1144-1152
pubmed: 30977299
Mol Biol Evol. 2018 Nov 1;35(11):2736-2750
pubmed: 30169787
Evolution. 2008 Dec;62(12):2984-94
pubmed: 18752601
Cell. 2018 Oct 4;175(2):347-359.e14
pubmed: 30290141
Mol Biol Evol. 2021 Jun 25;38(7):2967-2985
pubmed: 33624816
Science. 2016 Oct 7;354(6308):54-59
pubmed: 27846491
Genetics. 2018 Oct;210(2):719-731
pubmed: 30131346
Mol Ecol Resour. 2017 Jan;17(1):67-77
pubmed: 27601374
Mol Ecol. 2018 Jan;27(2):339-351
pubmed: 29193392
Biometrics. 1999 Dec;55(4):997-1004
pubmed: 11315092
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Genome Res. 2017 Jun;27(6):1029-1038
pubmed: 28385712
BMC Bioinformatics. 2014 Nov 25;15:356
pubmed: 25420514
Nat Genet. 2019 May;51(5):905-911
pubmed: 31043760
Am J Hum Genet. 2016 Mar 3;98(3):456-472
pubmed: 26924531
Mol Biol Evol. 2020 Jul 1;37(7):2153-2154
pubmed: 32343802
Mol Biol Evol. 2007 Mar;24(3):710-22
pubmed: 17182896
Ecol Evol. 2017 Oct 24;7(23):10143-10157
pubmed: 29238544
Evol Lett. 2020 Aug 19;4(5):430-443
pubmed: 33014419

Auteurs

Jonas Meisner (J)

Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark.

Anders Albrechtsen (A)

Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark.

Kristian Hanghøj (K)

Department of Biology, The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark. k.hanghoej@bio.ku.dk.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH