Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.
Journal
Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944
Informations de publication
Date de publication:
02 09 2023
02 09 2023
Historique:
received:
18
04
2023
revised:
19
07
2023
accepted:
18
08
2023
medline:
19
9
2023
pubmed:
21
8
2023
entrez:
21
8
2023
Statut:
ppublish
Résumé
The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. MashMap3 is available at https://github.com/marbl/MashMap.
Identifiants
pubmed: 37603771
pii: 7246743
doi: 10.1093/bioinformatics/btad512
pmc: PMC10505501
pii:
doi:
Types de publication
Journal Article
Research Support, N.I.H., Intramural
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : NLM NIH HHS
ID : T15 LM007093
Pays : United States
Organisme : NIAID NIH HHS
ID : P01 AI152999
Pays : United States
Organisme : NIDA NIH HHS
ID : U01 DA047638
Pays : United States
Organisme : NIGMS NIH HHS
ID : R01 GM123489
Pays : United States
Commentaires et corrections
Type : UpdateOf
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press.
Références
Nat Biotechnol. 2015 Jun;33(6):623-30
pubmed: 26006009
Nature. 2023 Sep;621(7978):344-354
pubmed: 37612512
Nature. 2022 Apr;604(7906):437-446
pubmed: 35444317
Genome Res. 2023 Jul;33(7):1188-1197
pubmed: 37399256
Bioinformatics. 2021 May 1;37(4):456-463
pubmed: 32915952
Bioinformatics. 2018 Sep 1;34(17):i748-i756
pubmed: 30423094
Nat Commun. 2019 Jul 11;10(1):3066
pubmed: 31296857
Bioinformatics. 2018 Jul 1;34(13):i13-i22
pubmed: 29949995
Genome Biol. 2019 Dec 4;20(1):265
pubmed: 31801633
Science. 2020 Dec 18;370(6523):
pubmed: 33335035
Bioinformatics. 2022 Jun 24;38(Suppl 1):i169-i176
pubmed: 35758786
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Nat Commun. 2017 May 16;8:15311
pubmed: 28508884
Bioinformatics. 2022 Oct 14;38(20):4848-4849
pubmed: 36063041
Science. 2022 Apr;376(6588):44-53
pubmed: 35357919
Nat Commun. 2018 Nov 30;9(1):5114
pubmed: 30504855
Nat Methods. 2022 Jun;19(6):705-710
pubmed: 35365778
Bioinformatics. 2022 Oct 14;38(20):4659-4669
pubmed: 36124869
Bioinformatics. 2013 Jan 1;29(1):119-21
pubmed: 23129296
Genome Res. 2021 Nov;31(11):2080-2094
pubmed: 34667119
Science. 2018 Jun 8;360(6393):
pubmed: 29880660
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004
pubmed: 36685727
PeerJ. 2021 Feb 5;9:e10805
pubmed: 33604186
Bioinformatics. 2017 Jul 15;33(14):i110-i117
pubmed: 28881970
J Comput Biol. 2018 Jul;25(7):766-779
pubmed: 29708767
Genome Biol. 2019 Nov 5;20(1):232
pubmed: 31690338
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842
J Comput Biol. 2022 Feb;29(2):155-168
pubmed: 35108101
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127
pubmed: 32657376
Bioinformatics. 2004 Dec 12;20(18):3363-9
pubmed: 15256412