Genie: the first open-source ISO/IEC encoder for genomic data.


Journal

Communications biology
ISSN: 2399-3642
Titre abrégé: Commun Biol
Pays: England
ID NLM: 101719179

Informations de publication

Date de publication:
09 May 2024
Historique:
received: 03 04 2023
accepted: 26 04 2024
medline: 10 5 2024
pubmed: 10 5 2024
entrez: 9 5 2024
Statut: epublish

Résumé

For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.

Identifiants

pubmed: 38724695
doi: 10.1038/s42003-024-06249-8
pii: 10.1038/s42003-024-06249-8
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

553

Subventions

Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : 01EK2204F
Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : 01EK2204F
Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : 01EK2204F
Organisme : Bundesministerium für Bildung und Forschung (Federal Ministry of Education and Research)
ID : 01EK2204F

Informations de copyright

© 2024. The Author(s).

Références

Stephens, Z. et al. Big data: Astronomical or genomical? PLOS Biol. 13, e1002195 (2015).
doi: 10.1371/journal.pbio.1002195 pubmed: 26151137 pmcid: 4494865
Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).
doi: 10.1093/nar/gkp1137 pubmed: 20015970
Deutsch, P. GZIP file format specification, version 4.3 (Network Working Group, 1996).
Roguski, Ł. & Deorowicz, S. DSRC 2 — industry-oriented compression of FASTQ files. Bioinformatics 30, 2213–2215 (2014).
doi: 10.1093/bioinformatics/btu208 pubmed: 24747219
Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
doi: 10.1093/bioinformatics/bts593 pubmed: 23047557 pmcid: 3509486
Jones, D., Ruzzo, W., Peng, X. & Katze, M. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012).
doi: 10.1093/nar/gks754 pubmed: 22904078 pmcid: 3526293
Benoit, G. et al. Reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph. BMC Bioinformatics 16, 288 (2015).
doi: 10.1186/s12859-015-0709-7 pubmed: 26370285 pmcid: 4570262
Roguski, L., Ochoa, I., Hernaez, M. & Deorowicz, S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 34, 2748–2756 (2018).
doi: 10.1093/bioinformatics/bty205 pubmed: 29617939
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. & Weissman, T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 35, 2674–2676 (2019).
doi: 10.1093/bioinformatics/bty1015 pubmed: 30535063
Kowalski, T. M. & Grabowski, S. PgRC: pseudogenome-based read compressor. Bioinformatics 36, 2082–2089 (2019).
doi: 10.1093/bioinformatics/btz919
Liu, Y. & Li, J. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. PLOS Comput. Biol. 17, 1–16 (2021).
doi: 10.1371/journal.pcbi.1009229
Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: a universal extensible genomic data compressor. Bioinformatics 37, 2225–2230 (2021).
doi: 10.1093/bioinformatics/btab102 pubmed: 33585897 pmcid: 8388020
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
doi: 10.1093/bioinformatics/btp352 pubmed: 19505943 pmcid: 2723002
Hach, F., Numanagic, I. & Sahinalp, S. DeeZ: reference-based compression by local assembly. Nat. Methods 11, 1082–1084 (2014).
doi: 10.1038/nmeth.3133 pubmed: 25357237
Bonfield, J. CRAM 3.1: advances in the CRAM file format. Bioinformatics 38, 1497–1503 (2022).
doi: 10.1093/bioinformatics/btac010 pubmed: 34999766 pmcid: 8896640
Numanagić, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005–1008 (2016).
doi: 10.1038/nmeth.4037 pubmed: 27776113
Ostermann, J. et al. Video coding with H.264/AVC: tools, performance, and complexity. IEEE Circ. Syst. Mag. 4, 7–28 (2004).
doi: 10.1109/MCAS.2004.1286980
Sullivan, G. J., Ohm, J.-R., Han, W.-J. & Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22, 1649–1668 (2012).
doi: 10.1109/TCSVT.2012.2221191
Bross, B. et al. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circ. Syst. Video Technol. 31, 3736–3764 (2021).
doi: 10.1109/TCSVT.2021.3101953
Voges, J., Hernaez, M., Mattavelli, M. & Ostermann, J. An introduction to MPEG-G: The first open ISO/IEC standard for the compression and exchange of genomic sequencing data. Proc. IEEE 109, 1607–1622 (2021).
doi: 10.1109/JPROC.2021.3082027
Voges, J. et al. GABAC: an arithmetic coding solution for genomic data. Bioinformatics 36, 2275–2277 (2020).
doi: 10.1093/bioinformatics/btz922 pubmed: 31830243
Voges, J., Ostermann, J. & Hernaez, M. CALQ: compression of quality values of aligned sequencing data. Bioinformatics 34, 1650–1658 (2017).
doi: 10.1093/bioinformatics/btx737 pmcid: 5946873
Marpe, D., Schwarz, H. & Wiegand, T. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Trans. Circ. Syst. Video Technol. 13, 620–636 (2003).
doi: 10.1109/TCSVT.2003.815173
Grebnov, I. libbsc. http://libbsc.com/ (2009).
Pavlov, I. Lzma. https://7-zip.org/sdk.html (1998).
Collet, Y. & Kucherawy, M. Zstandard Compression and the application/zstd Media Type. RFC 8478 (2018).
Müntefering, F. et al. Genie: The First Open-Source ISO/IEC Encoder for Genomic Data. Zenodo https://doi.org/10.5281/zenodo.10967397 (2024).

Auteurs

Fabian Müntefering (F)

Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany. muenteferi@tnt.uni-hannover.de.

Yeremia Gunawan Adhisantoso (YG)

Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.

Shubham Chandak (S)

Department of Electrical Engineering, Stanford University, 350 Jane Stanford Way, Stanford, CA, 94305, USA.

Jörn Ostermann (J)

Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.

Mikel Hernaez (M)

Center for Applied Medical Research (CIMA), University of Navarra, Av. de Pío XII, 55, Pamplona, 31008, Navarra, Spain. mhernaez@unav.es.

Jan Voges (J)

Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany. voges@tnt.uni-hannover.de.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH