Weighted minimizer sampling improves long read mapping.

Algorithms Data Compression Genomics High-Throughput Nucleotide Sequencing Humans Sequence Analysis, DNA Software

Journal

Bioinformatics (Oxford, England)

ISSN: 1367-4811

Titre abrégé: Bioinformatics

Pays: England

ID NLM: 9808944

Informations de publication

Date de publication:
01 07 2020

Historique:

entrez: 14 7 2020

pubmed: 14 7 2020

medline: 9 3 2021

Statut: ppublish

Résumé

In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.

Identifiants

DOI: 10.1093/bioinformatics/btaa435 PMID: 32657365 PMC: PMC7355284

pubmed: 32657365

pii: 5870473

doi: 10.1093/bioinformatics/btaa435

pmc: PMC7355284

doi:

Types de publication

Journal Article Research Support, N.I.H., Intramural

Langues

eng

Sous-ensembles de citation

Pagination

i111-i118

Informations de copyright

Published by Oxford University Press 2020.

Références

Nat Commun. 2017 May 16;8:15311

pubmed: 28508884

Nature. 2020 Jul 14;:

pubmed: 32663838

Genome Biol. 2004;5(2):R12

pubmed: 14759262

Bioinformatics. 2016 Jul 15;32(14):2103-10

pubmed: 27153593

J Comput Biol. 2020 Apr;27(4):472-484

pubmed: 32181688

Genome Res. 2017 May;27(5):722-736

pubmed: 28298431

Genome Res. 2017 May;27(5):849-864

pubmed: 28396521

Genome Biol. 2019 Sep 13;20(1):199

pubmed: 31519212

Nat Methods. 2012 Mar 04;9(4):357-9

pubmed: 22388286

Nat Commun. 2019 Jul 11;10(1):3066

pubmed: 31296857

Bioinformatics. 2018 Jul 1;34(13):i13-i22

pubmed: 29949995

Cell Syst. 2015 Aug 26;1(2):130-140

pubmed: 26436140

Bioinformatics. 2018 Sep 15;34(18):3094-3100

pubmed: 29750242

Bioinformatics. 2004 Dec 12;20(18):3363-9

pubmed: 15256412

J Comput Biol. 2015 May;22(5):336-52

pubmed: 25629448

Nat Biotechnol. 2015 Jun;33(6):623-30

pubmed: 26006009

PLoS One. 2011;6(12):e28819

pubmed: 22205972

Bioinformatics. 2013 Jan 1;29(1):119-21

pubmed: 23129296

Bioinformatics. 2017 Jul 15;33(14):i110-i117

pubmed: 28881970

Nat Biotechnol. 2020 May 4;:

pubmed: 32686750

Nucleic Acids Res. 1997 Sep 1;25(17):3389-402

pubmed: 9254694

J Mol Biol. 1981 Mar 25;147(1):195-7

pubmed: 7265238

J Comput Biol. 2018 Jul;25(7):766-779

pubmed: 29708767

Genome Biol. 2016 Jun 20;17(1):132

pubmed: 27323842

Weighted minimizer sampling improves long read mapping.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Chirag Jain (C)

Arang Rhie (A)

Haowen Zhang (H)

Claudia Chu (C)

Brian P Walenz (BP)

Sergey Koren (S)

Adam M Phillippy (AM)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH