Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
02 09 2023
Historique:
received: 18 04 2023
revised: 19 07 2023
accepted: 18 08 2023
medline: 19 9 2023
pubmed: 21 8 2023
entrez: 21 8 2023
Statut: ppublish

Résumé

The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. MashMap3 is available at https://github.com/marbl/MashMap.

Identifiants

pubmed: 37603771
pii: 7246743
doi: 10.1093/bioinformatics/btad512
pmc: PMC10505501
pii:
doi:

Types de publication

Journal Article Research Support, N.I.H., Intramural Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : NLM NIH HHS
ID : T15 LM007093
Pays : United States
Organisme : NIAID NIH HHS
ID : P01 AI152999
Pays : United States
Organisme : NIDA NIH HHS
ID : U01 DA047638
Pays : United States
Organisme : NIGMS NIH HHS
ID : R01 GM123489
Pays : United States

Commentaires et corrections

Type : UpdateOf

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press.

Références

Nat Biotechnol. 2015 Jun;33(6):623-30
pubmed: 26006009
Nature. 2023 Sep;621(7978):344-354
pubmed: 37612512
Nature. 2022 Apr;604(7906):437-446
pubmed: 35444317
Genome Res. 2023 Jul;33(7):1188-1197
pubmed: 37399256
Bioinformatics. 2021 May 1;37(4):456-463
pubmed: 32915952
Bioinformatics. 2018 Sep 1;34(17):i748-i756
pubmed: 30423094
Nat Commun. 2019 Jul 11;10(1):3066
pubmed: 31296857
Bioinformatics. 2018 Jul 1;34(13):i13-i22
pubmed: 29949995
Genome Biol. 2019 Dec 4;20(1):265
pubmed: 31801633
Science. 2020 Dec 18;370(6523):
pubmed: 33335035
Bioinformatics. 2022 Jun 24;38(Suppl 1):i169-i176
pubmed: 35758786
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Nat Commun. 2017 May 16;8:15311
pubmed: 28508884
Bioinformatics. 2022 Oct 14;38(20):4848-4849
pubmed: 36063041
Science. 2022 Apr;376(6588):44-53
pubmed: 35357919
Nat Commun. 2018 Nov 30;9(1):5114
pubmed: 30504855
Nat Methods. 2022 Jun;19(6):705-710
pubmed: 35365778
Bioinformatics. 2022 Oct 14;38(20):4659-4669
pubmed: 36124869
Bioinformatics. 2013 Jan 1;29(1):119-21
pubmed: 23129296
Genome Res. 2021 Nov;31(11):2080-2094
pubmed: 34667119
Science. 2018 Jun 8;360(6393):
pubmed: 29880660
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004
pubmed: 36685727
PeerJ. 2021 Feb 5;9:e10805
pubmed: 33604186
Bioinformatics. 2017 Jul 15;33(14):i110-i117
pubmed: 28881970
J Comput Biol. 2018 Jul;25(7):766-779
pubmed: 29708767
Genome Biol. 2019 Nov 5;20(1):232
pubmed: 31690338
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842
J Comput Biol. 2022 Feb;29(2):155-168
pubmed: 35108101
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127
pubmed: 32657376
Bioinformatics. 2004 Dec 12;20(18):3363-9
pubmed: 15256412

Auteurs

Bryce Kille (B)

Department of Computer Science, Rice University, Houston, TX, United States.

Erik Garrison (E)

Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States.

Todd J Treangen (TJ)

Department of Computer Science, Rice University, Houston, TX, United States.

Adam M Phillippy (AM)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States.

Articles similaires

Coal Metagenome Phylogeny Bacteria Genome, Bacterial
Humans Colorectal Neoplasms Biomarkers, Tumor Prognosis Gene Expression Regulation, Neoplastic
Genome, Bacterial Virulence Phylogeny Genomics Plant Diseases

Classifications MeSH