Weighted minimizer sampling improves long read mapping.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
01 07 2020
Historique:
entrez: 14 7 2020
pubmed: 14 7 2020
medline: 9 3 2021
Statut: ppublish

Résumé

In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.

Identifiants

pubmed: 32657365
pii: 5870473
doi: 10.1093/bioinformatics/btaa435
pmc: PMC7355284
doi:

Types de publication

Journal Article Research Support, N.I.H., Intramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

i111-i118

Informations de copyright

Published by Oxford University Press 2020.

Références

Nat Commun. 2017 May 16;8:15311
pubmed: 28508884
Nature. 2020 Jul 14;:
pubmed: 32663838
Genome Biol. 2004;5(2):R12
pubmed: 14759262
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
J Comput Biol. 2020 Apr;27(4):472-484
pubmed: 32181688
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Genome Res. 2017 May;27(5):849-864
pubmed: 28396521
Genome Biol. 2019 Sep 13;20(1):199
pubmed: 31519212
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
Nat Commun. 2019 Jul 11;10(1):3066
pubmed: 31296857
Bioinformatics. 2018 Jul 1;34(13):i13-i22
pubmed: 29949995
Cell Syst. 2015 Aug 26;1(2):130-140
pubmed: 26436140
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Bioinformatics. 2004 Dec 12;20(18):3363-9
pubmed: 15256412
J Comput Biol. 2015 May;22(5):336-52
pubmed: 25629448
Nat Biotechnol. 2015 Jun;33(6):623-30
pubmed: 26006009
PLoS One. 2011;6(12):e28819
pubmed: 22205972
Bioinformatics. 2013 Jan 1;29(1):119-21
pubmed: 23129296
Bioinformatics. 2017 Jul 15;33(14):i110-i117
pubmed: 28881970
Nat Biotechnol. 2020 May 4;:
pubmed: 32686750
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402
pubmed: 9254694
J Mol Biol. 1981 Mar 25;147(1):195-7
pubmed: 7265238
J Comput Biol. 2018 Jul;25(7):766-779
pubmed: 29708767
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842

Auteurs

Chirag Jain (C)

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

Arang Rhie (A)

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

Haowen Zhang (H)

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Claudia Chu (C)

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Brian P Walenz (BP)

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

Sergey Koren (S)

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

Adam M Phillippy (AM)

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH