Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.


Journal

Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021

Informations de publication

Date de publication:
07 2023
Historique:
received: 04 01 2023
accepted: 06 06 2023
medline: 28 8 2023
pubmed: 22 6 2023
entrez: 21 6 2023
Statut: ppublish

Résumé

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.

Identifiants

pubmed: 37344105
pii: gr.277651.123
doi: 10.1101/gr.277651.123
pmc: PMC10538494
doi:

Types de publication

Journal Article Research Support, U.S. Gov't, Non-P.H.S. Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

1061-1068

Subventions

Organisme : NIGMS NIH HHS
ID : R01 GM146462
Pays : United States

Informations de copyright

© 2023 Rahman Hera et al.; Published by Cold Spring Harbor Laboratory Press.

Références

Int J Syst Evol Microbiol. 2016 Feb;66(2):1100-1103
pubmed: 26585518
Nat Commun. 2018 Nov 30;9(1):5114
pubmed: 30504855
Genome Biol. 2020 Sep 10;21(1):242
pubmed: 32912225
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106
pubmed: 28910776
Genome Biol. 2018 Nov 16;19(1):198
pubmed: 30445993
J Mol Evol. 1995 Mar;40(3):318-25
pubmed: 7723058
Genetics. 1981 Jul;98(3):641-57
pubmed: 7333455
F1000Res. 2015 Sep 25;4:900
pubmed: 26535114
Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794
pubmed: 34520557
BMC Genomics. 2019 Jun 6;20(Suppl 5):423
pubmed: 31167634
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
J Comput Biol. 2022 Feb;29(2):155-168
pubmed: 35108101
PLoS One. 2014 Jul 25;9(7):e101271
pubmed: 25062443
Microb Genom. 2021 Dec;7(12):
pubmed: 34913861
Bioinformatics. 2009 Nov 1;25(21):2872-7
pubmed: 19528083
Theor Popul Biol. 1984 Oct;26(2):119-64
pubmed: 6505980
F1000Res. 2019 Jul 4;8:1006
pubmed: 31508216
Cell Syst. 2021 Oct 20;12(10):958-968.e6
pubmed: 34525345
Genome Res. 2012 Mar;22(3):557-67
pubmed: 22147368
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842
Nat Commun. 2021 Jan 4;12(1):2
pubmed: 33397972

Auteurs

Mahmudur Rahman Hera (M)

Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA.

N Tessa Pierce-Ward (NT)

Department of Population Health and Reproduction, University of California, Davis, California 95616, USA.

David Koslicki (D)

Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA; dmk333@psu.edu.
Department of Biology, The Pennsylvania State University, State College, Pennsylvania 16801, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, Pennsylvania 16801, USA.

Articles similaires

A scenario for an evolutionary selection of ageing.

Tristan Roget, Claire Macmurray, Pierre Jolivet et al.
1.00
Aging Selection, Genetic Biological Evolution Animals Fertility
Coal Metagenome Phylogeny Bacteria Genome, Bacterial
Biological Evolution History, 20th Century Selection, Genetic History, 19th Century Biology
Animals Biological Evolution Amphibians Fossils Wyoming

Classifications MeSH