Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data.

genotype likelihoods high-throughput sequencing maximum likelihood next-generation sequencing population genetics site frequency spectrum threading

Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
17 05 2022
Historique:
received: 21 10 2021
revised: 16 12 2021
entrez: 17 5 2022
pubmed: 18 5 2022
medline: 20 5 2022
Statut: ppublish

Résumé

The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.

Sections du résumé

BACKGROUND
The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping.
RESULTS
Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data.
CONCLUSION
The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.

Identifiants

pubmed: 35579549
pii: 6586813
doi: 10.1093/gigascience/giac032
pmc: PMC9112775
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2022. Published by Oxford University Press GigaScience.

Références

BMC Bioinformatics. 2011 Jun 11;12:231
pubmed: 21663684
Genetics. 2013 Nov;195(3):979-92
pubmed: 23979584
Science. 2010 Jul 2;329(5987):75-8
pubmed: 20595611
Curr Opin Genet Dev. 2016 Dec;41:36-43
pubmed: 27589081
Gigascience. 2019 May 1;8(5):
pubmed: 31042285
Nat Rev Genet. 2011 Jun;12(6):443-51
pubmed: 21587300
PLoS Genet. 2009 Oct;5(10):e1000695
pubmed: 19851460
Bioinformatics. 2011 Nov 1;27(21):2987-93
pubmed: 21903627
Annu Rev Genet. 2005;39:197-218
pubmed: 16285858
BMC Bioinformatics. 2014 Nov 25;15:356
pubmed: 25420514
Annu Rev Genomics Hum Genet. 2016 Aug 31;17:95-115
pubmed: 27362342
PLoS One. 2013 Nov 18;8(11):e79667
pubmed: 24260275
Bioinformatics. 2014 May 15;30(10):1486-7
pubmed: 24458950
PLoS One. 2012;7(7):e37558
pubmed: 22911679
Proc Natl Acad Sci U S A. 1979 Oct;76(10):5269-73
pubmed: 291943
Bioinformatics. 2015 Mar 1;31(5):720-7
pubmed: 25359894
NAR Genom Bioinform. 2021 Mar 27;3(1):lqab019
pubmed: 33817639
Genetics. 2015 Mar;199(3):841-56
pubmed: 25575536
Mol Ecol. 2021 Dec;30(23):5966-5993
pubmed: 34250668
Evolution. 2013 Nov;67(11):3274-89
pubmed: 24152007
Science. 2005 Oct 14;310(5746):321-4
pubmed: 16224025
Bioinformatics. 2019 Oct 1;35(19):3855-3856
pubmed: 30903149
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Genetics. 2022 Mar 3;220(3):
pubmed: 34897427
Genome Res. 2013 Sep;23(9):1514-21
pubmed: 23861382
BMC Bioinformatics. 2013 Oct 02;14:289
pubmed: 24088262
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Bioinformatics. 2016 Jul 15;32(14):2096-102
pubmed: 27153648
Genome Res. 2009 Jun;19(6):1124-32
pubmed: 19420381
Bioinformatics. 2002 Feb;18(2):337-8
pubmed: 11847089
Gigascience. 2022 May 17;11:
pubmed: 35579549
Evol Appl. 2020 Jun 02;13(9):2254-2263
pubmed: 33005222

Auteurs

Alex Mas-Sandoval (A)

Department of Life Sciences, Silwood Park campus, Imperial College London, SL5 7PY, Ascot, UK.

Nathaniel S Pope (NS)

Department of Entomology, The Pennsylvania State University, 201 Old Main, University Park, PA 16802, USA.

Knud Nor Nielsen (KN)

Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, 1871 Frederiksberg C, Denmark.

Isin Altinkaya (I)

GLOBE, Section for Geogenetics, Øster Voldgade 5-7, 1350, Copenhagen, Denmark.

Matteo Fumagalli (M)

Department of Life Sciences, Silwood Park campus, Imperial College London, SL5 7PY, Ascot, UK.
School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK.

Thorfinn Sand Korneliussen (TS)

GLOBE, Section for Geogenetics, Øster Voldgade 5-7, 1350, Copenhagen, Denmark.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH