Disk compression of k-mer sets.

Compression De Bruijn graphs Spectrum-preserving string sets k-mer sets

Journal

Algorithms for molecular biology : AMB
ISSN: 1748-7188
Titre abrégé: Algorithms Mol Biol
Pays: England
ID NLM: 101265088

Informations de publication

Date de publication:
21 Jun 2021
Historique:
received: 05 02 2021
accepted: 08 06 2021
entrez: 22 6 2021
pubmed: 23 6 2021
medline: 23 6 2021
Statut: epublish

Résumé

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Identifiants

pubmed: 34154632
doi: 10.1186/s13015-021-00192-7
pii: 10.1186/s13015-021-00192-7
pmc: PMC8218509
doi:

Types de publication

Journal Article

Langues

eng

Pagination

10

Subventions

Organisme : National Science Foundation
ID : 1453527
Organisme : National Science Foundation
ID : 1439057
Organisme : National Science Foundation
ID : 1453527
Organisme : National Science Foundation
ID : 1439057
Organisme : NIH HHS
ID : CBIOS training grant
Pays : United States

Références

Bioinformatics. 2018 Feb 15;34(4):568-575
pubmed: 29444235
Genome Biol. 2021 Apr 6;22(1):96
pubmed: 33823902
Bioinformatics. 2020 Feb 1;36(3):721-727
pubmed: 31504157
Bioinformatics. 2017 Sep 1;33(17):2759-2761
pubmed: 28472236
Genome Biol. 2014 Mar 03;15(3):R46
pubmed: 24580807
Bioinformatics. 2013 Mar 1;29(5):652-3
pubmed: 23325618
iScience. 2019 Aug 30;18:28-36
pubmed: 31377530
Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185
pubmed: 32657392
Bioinformatics. 2018 Sep 1;34(17):i766-i772
pubmed: 30423080
Nat Methods. 2016 Dec;13(12):1005-1008
pubmed: 27776113
Bioinformatics. 2019 Feb 1;35(3):415-420
pubmed: 30032192
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
Genome Res. 2021 Jan;31(1):1-12
pubmed: 33328168
Bioinformatics. 2019 Oct 1;35(19):3826-3828
pubmed: 30799504
iScience. 2019 Aug 30;18:20-27
pubmed: 31352182
Cell Syst. 2018 Aug 22;7(2):201-207.e4
pubmed: 29936185
Nat Biotechnol. 2016 Mar;34(3):300-2
pubmed: 26854477
J Comput Biol. 2018 Jul;25(7):755-765
pubmed: 29641248
Bioinformatics. 2014 Jan 1;30(1):117-8
pubmed: 24132931
Nat Biotechnol. 2019 Feb;37(2):152-159
pubmed: 30718882
Bioinformatics. 2011 Mar 15;27(6):764-70
pubmed: 21217122
Bioinformatics. 2018 Aug 1;34(15):2556-2565
pubmed: 29554215
Genome Res. 2012 Mar;22(3):557-67
pubmed: 22147368
Bioinformatics. 2016 Jun 15;32(12):i201-i208
pubmed: 27307618
Genome Biol. 2016 Jun 20;17(1):132
pubmed: 27323842

Auteurs

Amatur Rahman (A)

Penn State University, State College, PA, USA. aur1111@psu.edu.

Rayan Chikhi (R)

Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France.

Paul Medvedev (P)

Penn State University, State College, PA, USA.

Classifications MeSH