A new efficient referential genome compression technique for FastQ files.
Compression
Decompression
FastQ
Identifiers
Quality scores
Journal
Functional & integrative genomics
ISSN: 1438-7948
Titre abrégé: Funct Integr Genomics
Pays: Germany
ID NLM: 100939343
Informations de publication
Date de publication:
11 Nov 2023
11 Nov 2023
Historique:
received:
17
07
2023
accepted:
20
10
2023
revised:
13
10
2023
medline:
13
11
2023
pubmed:
11
11
2023
entrez:
10
11
2023
Statut:
epublish
Résumé
Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80-140% for fixed-length datasets and 80-125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10-25%.
Identifiants
pubmed: 37950100
doi: 10.1007/s10142-023-01259-x
pii: 10.1007/s10142-023-01259-x
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
333Informations de copyright
© 2023. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
Références
Bhukya R et al (2020) Compression for DNA sequences using Huffman encoding. In: Information and Communication Technology for Sustainable Development. Springer, Singapore, pp 615–624
doi: 10.1007/978-981-13-7166-0_61
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. Plos One 8(3):e59190
doi: 10.1371/journal.pone.0059190
Chandak S et al (2018) SPRING: a next-generation compressor for FASTQ data. Bioinformatics
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862
doi: 10.1093/bioinformatics/btr014
Dutta A, Haque MM, Bose T, Reddy CV, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of FastQ datasets. J Bioinform Comput Biol 13(3):1541003
doi: 10.1142/S0219720015410036
Genome is digital, and can be compressed, 2022 Available at: https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/ [21-5-2022]
Guerra A et al (2019) Tackling the challenges of FASTQ referential compression. Bioinform Biol Insights 13:1177932218821373
doi: 10.1177/1177932218821373
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z (2017) LW-FQZip 2: a parallelized reference-base compression of FASTQ files. BMC Bioinform 18(1):179
doi: 10.1186/s12859-017-1588-x
Jian DD et al (2020) Genome compression and decompression. U.S. Patent No. 10,679,727
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40(22):e171
doi: 10.1093/nar/gks754
Kowalski TM, Grabowski S (2020) PgRC: pseudogenome-based read compressor. Bioinformatics 36(7):2082–2089
doi: 10.1093/bioinformatics/btz919
Kredens KV et al (2020) Vertical lossless genomic data compression tools for assembled genomes: a systematic literature review. Plos One 15(5):e0232942
doi: 10.1371/journal.pone.0232942
Kryukov K et al (2020) Sequence Compression Benchmark (SCB) database—a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. GigaScience 9(7):giaa072. https://www.ncbi.nlm.nih.gov/sra . Accessed Jun 2022
doi: 10.1093/gigascience/giaa072
Kumar S, Agarwal S (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinforma Comput Biol 1850018
Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows wheeler transform and wavelet tree. (ICACCE), 2015 Second International Conference on 2015 May 1. IEEE, pp 133–138
Lee SJ, Cho GY, Ikeno F, Lee TR (2018) BAQALC: blockchain applied lossless efficient transmission of DNA sequencing data for next generation medical informatics. Appl Sci 8(9):1471
doi: 10.3390/app8091471
Liu Y, Peng H, Wong L, Li J (2017) High-speed and high-ratio referential genome compression. Bioinformatics 33(21):3364–3372
doi: 10.1093/bioinformatics/btx412
Mansouri D, Yuan X, Saidani A (2020) A new lossless DNA compression algorithm based on a single-block encoding scheme. Algorithms 13(4):99
doi: 10.3390/a13040099
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281
doi: 10.1093/bioinformatics/btv384
Rabbani L, Müller J, Weigel D (2020) An algorithm to build a multi-genome reference. bioRxiv
doi: 10.1101/2020.04.11.036871
Roguski DS (2014) DSRC 2Industry-oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215
doi: 10.1093/bioinformatics/btu208
Shokrof M, Abouelhoda M (2020) IonCRAM: a reference-based compression tool for ion torrent sequence files. BMC Bioinform 21(1):1–16
Sultan AY, Huang C-H (2019) LFastqC: a lossless non-reference-based FASTQ compressor. Plos One 14:11
Tembe W, Lowey J, Suh E (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194
doi: 10.1093/bioinformatics/btq346
Wan R, Anh VN, Asai K (2011) Transformations for the compression of FASTQ quality scores of next generation sequencing data. Bioinformatics 28(5):628–635
doi: 10.1093/bioinformatics/btr689
Wandelt S, Bux M, Leser U (2014) Trends in genome compression. Curr Bioinform 9:3
doi: 10.2174/1574893609666140516010143
Yu R, Yang W (2020) ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics
Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinform 16(1):188
doi: 10.1186/s12859-015-0628-7