Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data.

Gene Frequency Genetics, Population Genotype High-Throughput Nucleotide Sequencing / methods Humans Likelihood Functions Polymorphism, Single Nucleotide Sequence Analysis, DNA / methods

genotype likelihoods high-throughput sequencing maximum likelihood next-generation sequencing population genetics site frequency spectrum threading

Journal

GigaScience

ISSN: 2047-217X

Titre abrégé: Gigascience

Pays: United States

ID NLM: 101596872

Informations de publication

Date de publication:
17 05 2022

Historique:

received: 21 10 2021

revised: 16 12 2021

entrez: 17 5 2022

pubmed: 18 5 2022

medline: 20 5 2022

Statut: ppublish

Résumé

The site frequency spectrum summarizes the distribution of allele frequencies throughout the genome, and it is widely used as a summary statistic to infer demographic parameters and to detect signals of natural selection. The use of high-throughput low-coverage DNA sequencing data can lead to biased estimates of the site frequency spectrum due to high levels of uncertainty in genotyping. Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data. The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.

Sections du résumé

BACKGROUND

RESULTS

Here we design and implement a method to efficiently and accurately estimate the multidimensional joint site frequency spectrum for large numbers of haploid or diploid individuals across an arbitrary number of populations, using low-coverage sequencing data. The method maximizes a likelihood function that represents the probability of the sequencing data observed given a multidimensional site frequency spectrum using genotype likelihoods. Notably, it uses an advanced binning heuristic paired with an accelerated expectation-maximization algorithm for a fast and memory-efficient computation, and can generate both unfolded and folded spectra and bootstrapped replicates for haploid and diploid genomes. On the basis of extensive simulations, we show that the new method requires remarkably less storage and is faster than previous implementations whilst retaining the same accuracy. When applied to low-coverage sequencing data from the fungal pathogen Neonectria neomacrospora, results recapitulate the patterns of population differentiation generated using the original high-coverage data.

CONCLUSION

The new implementation allows for accurate estimation of population genetic parameters from arbitrarily large, low-coverage datasets, thus facilitating cost-effective sequencing experiments in model and non-model organisms.

Identifiants

DOI: 10.1093/gigascience/giac032 PMID: 35579549 PMC: PMC9112775

pubmed: 35579549

pii: 6586813

doi: 10.1093/gigascience/giac032

pmc: PMC9112775

pii:

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Informations de copyright

Références

BMC Bioinformatics. 2011 Jun 11;12:231

pubmed: 21663684

Genetics. 2013 Nov;195(3):979-92

pubmed: 23979584

Science. 2010 Jul 2;329(5987):75-8

pubmed: 20595611

Curr Opin Genet Dev. 2016 Dec;41:36-43

pubmed: 27589081

Gigascience. 2019 May 1;8(5):

pubmed: 31042285

Nat Rev Genet. 2011 Jun;12(6):443-51

pubmed: 21587300

PLoS Genet. 2009 Oct;5(10):e1000695

pubmed: 19851460

Bioinformatics. 2011 Nov 1;27(21):2987-93

pubmed: 21903627

Annu Rev Genet. 2005;39:197-218

pubmed: 16285858

BMC Bioinformatics. 2014 Nov 25;15:356

pubmed: 25420514

Annu Rev Genomics Hum Genet. 2016 Aug 31;17:95-115

pubmed: 27362342

PLoS One. 2013 Nov 18;8(11):e79667

pubmed: 24260275

Bioinformatics. 2014 May 15;30(10):1486-7

pubmed: 24458950

PLoS One. 2012;7(7):e37558

pubmed: 22911679

Proc Natl Acad Sci U S A. 1979 Oct;76(10):5269-73

pubmed: 291943

Bioinformatics. 2015 Mar 1;31(5):720-7

pubmed: 25359894

NAR Genom Bioinform. 2021 Mar 27;3(1):lqab019

pubmed: 33817639

Genetics. 2015 Mar;199(3):841-56

pubmed: 25575536

Mol Ecol. 2021 Dec;30(23):5966-5993

pubmed: 34250668

Evolution. 2013 Nov;67(11):3274-89

pubmed: 24152007

Science. 2005 Oct 14;310(5746):321-4

pubmed: 16224025

Bioinformatics. 2019 Oct 1;35(19):3855-3856

pubmed: 30903149

Genome Res. 2010 Sep;20(9):1297-303

pubmed: 20644199

Genetics. 2022 Mar 3;220(3):

pubmed: 34897427

Genome Res. 2013 Sep;23(9):1514-21

pubmed: 23861382

BMC Bioinformatics. 2013 Oct 02;14:289

pubmed: 24088262

Nature. 2015 Oct 1;526(7571):68-74

pubmed: 26432245

Bioinformatics. 2016 Jul 15;32(14):2096-102

pubmed: 27153648

Genome Res. 2009 Jun;19(6):1124-32

pubmed: 19420381

Bioinformatics. 2002 Feb;18(2):337-8

pubmed: 11847089

Gigascience. 2022 May 17;11:

pubmed: 35579549

Evol Appl. 2020 Jun 02;13(9):2254-2263

pubmed: 33005222

Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Références

Auteurs

Alex Mas-Sandoval (A)

Nathaniel S Pope (NS)

Knud Nor Nielsen (KN)

Isin Altinkaya (I)

Matteo Fumagalli (M)

Thorfinn Sand Korneliussen (TS)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH