Personalized pangenome references.
Journal
Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604
Informations de publication
Date de publication:
11 Sep 2024
11 Sep 2024
Historique:
received:
18
12
2023
accepted:
06
08
2024
medline:
12
9
2024
pubmed:
12
9
2024
entrez:
11
9
2024
Statut:
aheadofprint
Résumé
Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.
Identifiants
pubmed: 39261641
doi: 10.1038/s41592-024-02407-2
pii: 10.1038/s41592-024-02407-2
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : R01HG010485
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : U24HG010262
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : U24HG011853
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : U01HG010961
Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)
ID : OT2OD033761
Organisme : U.S. Department of Health & Human Services | NIH | National Heart, Lung, and Blood Institute (NHLBI)
ID : OT3HL142481
Informations de copyright
© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.
Références
Eizenga, J. M. et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020).
doi: 10.1146/annurev-genom-120219-080406
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
doi: 10.1038/nbt.4227
pubmed: 30125266
pmcid: 6126949
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
doi: 10.1186/s13059-020-02157-2
pubmed: 32972461
pmcid: 7513500
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
doi: 10.1126/science.abg8871
pubmed: 34914532
pmcid: 9365333
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015).
doi: 10.1038/nature15393
Pritt, J., Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018).
doi: 10.1186/s13059-018-1595-x
pubmed: 30558649
pmcid: 6296055
Liao, Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
doi: 10.1038/s41586-023-05896-x
pubmed: 37165242
pmcid: 10172123
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).
doi: 10.1038/ng.3257
pubmed: 25915597
pmcid: 4449272
Vaddadi, K., Mun, T. & Langmead, B. Minimizing reference bias with an impute-first approach. Preprint bioRxiv https://doi.org/10.1101/2023.11.30.568362 (2023).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
doi: 10.1038/nbt.4235
pubmed: 30247488
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
doi: 10.1186/s13059-020-1941-7
pubmed: 32051000
pmcid: 7017486
Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).
doi: 10.1038/s41588-022-01043-w
pubmed: 35410384
pmcid: 9005351
Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
doi: 10.1093/bioinformatics/btx304
pubmed: 28472236
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
doi: 10.1016/j.xgen.2022.100128
pubmed: 36452119
pmcid: 9706577
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
doi: 10.1038/s41587-019-0054-x
pubmed: 30858580
pmcid: 6699627
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Carroll, A. et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.553043 (2023).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
doi: 10.1038/s41587-021-01158-1
pubmed: 35132260
pmcid: 9117392
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
doi: 10.1038/s41592-018-0054-7
pubmed: 30013044
pmcid: 6341484
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
doi: 10.1038/s41587-020-0538-8
pubmed: 32541955
pmcid: 8454654
Kolmogorov, M. et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).
doi: 10.1038/s41592-023-01993-x
pubmed: 37710018
pmcid: 11222905
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).
doi: 10.1101/gr.260604.119
pubmed: 33328168
pmcid: 7849385
Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).
doi: 10.1038/nrg1767
pubmed: 16418744
Rausch, T. et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
doi: 10.1093/bioinformatics/bts378
pubmed: 22962449
pmcid: 3436805
Mohiyuddin, M. et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015).
doi: 10.1093/bioinformatics/btv204
pubmed: 25861968
pmcid: 4528635
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
doi: 10.1093/bioinformatics/btv710
pubmed: 26647377
Fang, H. et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016).
doi: 10.1038/nprot.2016.150
pubmed: 27854363
pmcid: 5507611
Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
doi: 10.1101/gr.221028.117
pubmed: 29535149
pmcid: 5880247
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
doi: 10.1038/s41587-019-0072-8
pubmed: 30936562
Jiang, T. et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020).
doi: 10.1186/s13059-020-02107-y
pubmed: 32746918
pmcid: 7477834
Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02024-y (2024).
Sirén, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022).
doi: 10.1093/bioinformatics/btac656
pubmed: 36179091
pmcid: 9665857
Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).
doi: 10.1093/bioinformatics/btz575
pubmed: 31406990
Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).
doi: 10.1089/cmb.2017.0251
pubmed: 29461862
pmcid: 6067107
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
doi: 10.1101/gr.074492.107
pubmed: 18349386
pmcid: 2336801
Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020).
doi: 10.1093/bioinformatics/btaa446
pubmed: 32657356
pmcid: 7355256
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020).
doi: 10.1145/3375890
Dufresne, Y. et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022).
doi: 10.1093/bioinformatics/btac528
pubmed: 35904548
pmcid: 9477520