GBZ file format for pangenome graphs.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
15 11 2022
Historique:
received: 12 07 2022
revised: 06 09 2022
accepted: 30 09 2022
pubmed: 1 10 2022
medline: 19 11 2022
entrez: 30 9 2022
Statut: ppublish

Résumé

Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space efficiently. We propose the GBZ file format based on data structures used in the Giraffe short-read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems. C++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 36179091
pii: 6731924
doi: 10.1093/bioinformatics/btac656
pmc: PMC9665857
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

5012-5018

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG010485
Pays : United States
Organisme : NIH HHS
ID : U01HG010961
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : NIH HHS
ID : OT2 OD026682
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG010972
Pays : United States

Informations de copyright

© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Références

Genome Biol. 2019 Dec 19;20(1):291
pubmed: 31856913
Genome Biol. 2020 Sep 24;21(1):253
pubmed: 32972461
Nature. 2022 Apr;604(7906):437-446
pubmed: 35444317
Nat Biotechnol. 2019 Aug;37(8):907-915
pubmed: 31375807
Bioinformatics. 2020 Jan 15;36(2):400-407
pubmed: 31406990
Nat Biotechnol. 2018 Oct;36(9):875-879
pubmed: 30125266
Genome Biol. 2020 Feb 12;21(1):35
pubmed: 32051000
Annu Rev Genomics Hum Genet. 2020 Aug 31;21:139-162
pubmed: 32453966
Nat Genet. 2017 Nov;49(11):1654-1660
pubmed: 28945251
Genome Biol. 2020 Oct 16;21(1):265
pubmed: 33066802
Bioinformatics. 2021 Jan 29;36(21):5139-5144
pubmed: 33040146
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Nat Genet. 2022 Apr;54(4):518-525
pubmed: 35410384
Science. 2021 Dec 17;374(6574):abg8871
pubmed: 34914532

Auteurs

Jouni Sirén (J)

Genomics Institute, University of California, Santa Cruz, CA 95064, USA.

Benedict Paten (B)

Genomics Institute, University of California, Santa Cruz, CA 95064, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH