Detecting selection in low-coverage high-throughput sequencing data using principal component analysis.

Genetics, Population Genome Genotype High-Throughput Nucleotide Sequencing Humans Polymorphism, Single Nucleotide Principal Component Analysis

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
29 Sep 2021

Historique:

received: 04 03 2021

accepted: 10 09 2021

entrez: 30 9 2021

pubmed: 1 10 2021

medline: 2 10 2021

Statut: epublish

Résumé

Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Sections du résumé

BACKGROUND BACKGROUND

MATERIALS AND METHODS METHODS

We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.

RESULTS RESULTS

Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.

CONCLUSION CONCLUSIONS

We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Identifiants

DOI: 10.1186/s12859-021-04375-2 PMID: 34587903 PMC: PMC8480091

pubmed: 34587903

doi: 10.1186/s12859-021-04375-2

pii: 10.1186/s12859-021-04375-2

pmc: PMC8480091

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

470

Subventions

Organisme : Novo Nordisk Fonden

ID : NNF20OC0061343

Organisme : Det Frie Forskningsråd

ID : DFF 0135-00211B

Informations de copyright

Références

Am J Hum Genet. 2004 Jun;74(6):1111-20

pubmed: 15114531

Mol Ecol. 2019 Dec;28(24):5232-5247

pubmed: 31647597

Science. 2010 Jul 2;329(5987):75-8

pubmed: 20595611

PLoS Biol. 2006 Mar;4(3):e72

pubmed: 16494531

Bioinformatics. 2011 Nov 1;27(21):2987-93

pubmed: 21903627

Nat Commun. 2019 Oct 22;10(1):4811

pubmed: 31641125

PLoS One. 2012;7(7):e37558

pubmed: 22911679

Mol Ecol Resour. 2019 Sep;19(5):1144-1152

pubmed: 30977299

Mol Biol Evol. 2018 Nov 1;35(11):2736-2750

pubmed: 30169787

Evolution. 2008 Dec;62(12):2984-94

pubmed: 18752601

Cell. 2018 Oct 4;175(2):347-359.e14

pubmed: 30290141

Mol Biol Evol. 2021 Jun 25;38(7):2967-2985

pubmed: 33624816

Science. 2016 Oct 7;354(6308):54-59

pubmed: 27846491

Genetics. 2018 Oct;210(2):719-731

pubmed: 30131346

Mol Ecol Resour. 2017 Jan;17(1):67-77

pubmed: 27601374

Mol Ecol. 2018 Jan;27(2):339-351

pubmed: 29193392

Biometrics. 1999 Dec;55(4):997-1004

pubmed: 11315092

Nature. 2015 Oct 1;526(7571):68-74

pubmed: 26432245

Genome Res. 2017 Jun;27(6):1029-1038

pubmed: 28385712

BMC Bioinformatics. 2014 Nov 25;15:356

pubmed: 25420514

Nat Genet. 2019 May;51(5):905-911

pubmed: 31043760

Am J Hum Genet. 2016 Mar 3;98(3):456-472

pubmed: 26924531

Mol Biol Evol. 2020 Jul 1;37(7):2153-2154

pubmed: 32343802

Mol Biol Evol. 2007 Mar;24(3):710-22

pubmed: 17182896

Ecol Evol. 2017 Oct 24;7(23):10143-10157

pubmed: 29238544

Evol Lett. 2020 Aug 19;4(5):430-443

pubmed: 33014419

Detecting selection in low-coverage high-throughput sequencing data using principal component analysis.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Jonas Meisner (J)

Anders Albrechtsen (A)

Kristian Hanghøj (K)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH