Personalized pangenome references.

Journal

Nature methods

ISSN: 1548-7105

Titre abrégé: Nat Methods

Pays: United States

ID NLM: 101215604

Informations de publication

Date de publication:
11 Sep 2024

Historique:

received: 18 12 2023

accepted: 06 08 2024

medline: 12 9 2024

pubmed: 12 9 2024

entrez: 11 9 2024

Statut: aheadofprint

Résumé

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

Identifiants

DOI: 10.1038/s41592-024-02407-2 PMID: 39261641

pubmed: 39261641

doi: 10.1038/s41592-024-02407-2

pii: 10.1038/s41592-024-02407-2

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : R01HG010485

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : U24HG010262

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : U24HG011853

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : U01HG010961

Organisme : U.S. Department of Health & Human Services | National Institutes of Health (NIH)

ID : OT2OD033761

Organisme : U.S. Department of Health & Human Services | NIH | National Heart, Lung, and Blood Institute (NHLBI)

ID : OT3HL142481

Informations de copyright

Références

Eizenga, J. M. et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139–162 (2020).

doi: 10.1146/annurev-genom-120219-080406

Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

doi: 10.1038/nbt.4227 pubmed: 30125266 pmcid: 6126949

Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).

doi: 10.1186/s13059-020-02157-2 pubmed: 32972461 pmcid: 7513500

Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).

doi: 10.1126/science.abg8871 pubmed: 34914532 pmcid: 9365333

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–64 (2015).

doi: 10.1038/nature15393

Pritt, J., Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018).

doi: 10.1186/s13059-018-1595-x pubmed: 30558649 pmcid: 6296055

Liao, Wen-Wei et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

doi: 10.1038/s41586-023-05896-x pubmed: 37165242 pmcid: 10172123

Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

doi: 10.1038/ng.3257 pubmed: 25915597 pmcid: 4449272

Vaddadi, K., Mun, T. & Langmead, B. Minimizing reference bias with an impute-first approach. Preprint bioRxiv https://doi.org/10.1101/2023.11.30.568362 (2023).

Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).

Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

doi: 10.1038/nbt.4235 pubmed: 30247488

Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).

doi: 10.1186/s13059-020-1941-7 pubmed: 32051000 pmcid: 7017486

Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518–525 (2022).

doi: 10.1038/s41588-022-01043-w pubmed: 35410384 pmcid: 9005351

Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023).

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).

Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

doi: 10.1093/bioinformatics/btx304 pubmed: 28472236

Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).

doi: 10.1016/j.xgen.2022.100128 pubmed: 36452119 pmcid: 9706577

Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

doi: 10.1038/s41587-019-0054-x pubmed: 30858580 pmcid: 6699627

Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

Carroll, A. et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.553043 (2023).

Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

doi: 10.1038/s41587-021-01158-1 pubmed: 35132260 pmcid: 9117392

Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

doi: 10.1038/s41592-018-0054-7 pubmed: 30013044 pmcid: 6341484

Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

doi: 10.1038/s41587-020-0538-8 pubmed: 32541955 pmcid: 8454654

Kolmogorov, M. et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).

doi: 10.1038/s41592-023-01993-x pubmed: 37710018 pmcid: 11222905

Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).

doi: 10.1101/gr.260604.119 pubmed: 33328168 pmcid: 7849385

Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).

doi: 10.1038/nrg1767 pubmed: 16418744

Rausch, T. et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

doi: 10.1093/bioinformatics/bts378 pubmed: 22962449 pmcid: 3436805

Mohiyuddin, M. et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741–2744 (2015).

doi: 10.1093/bioinformatics/btv204 pubmed: 25861968 pmcid: 4528635

Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

doi: 10.1093/bioinformatics/btv710 pubmed: 26647377

Fang, H. et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529–2548 (2016).

doi: 10.1038/nprot.2016.150 pubmed: 27854363 pmcid: 5507611

Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).

doi: 10.1101/gr.221028.117 pubmed: 29535149 pmcid: 5880247

Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

doi: 10.1038/s41587-019-0072-8 pubmed: 30936562

Jiang, T. et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020).

doi: 10.1186/s13059-020-02107-y pubmed: 32746918 pmcid: 7477834

Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02024-y (2024).

Sirén, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012–5018 (2022).

doi: 10.1093/bioinformatics/btac656 pubmed: 36179091 pmcid: 9665857

Sirén, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400–407 (2020).

doi: 10.1093/bioinformatics/btz575 pubmed: 31406990

Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).

doi: 10.1089/cmb.2017.0251 pubmed: 29461862 pmcid: 6067107

Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

doi: 10.1101/gr.074492.107 pubmed: 18349386 pmcid: 2336801

Chang, X., Eizenga, J., Novak, A. M., Sirén, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146–i153 (2020).

doi: 10.1093/bioinformatics/btaa446 pubmed: 32657356 pmcid: 7355256

Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020).

doi: 10.1145/3375890

Dufresne, Y. et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423–4425 (2022).

doi: 10.1093/bioinformatics/btac528 pubmed: 35904548 pmcid: 9477520

Personalized pangenome references.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Jouni Sirén (J)

Parsa Eskandar (P)

Matteo Tommaso Ungaro (MT)

Glenn Hickey (G)

Jordan M Eizenga (JM)

Adam M Novak (AM)

Xian Chang (X)

Pi-Chuan Chang (PC)

Mikhail Kolmogorov (M)

Andrew Carroll (A)

Jean Monlong (J)

Benedict Paten (B)

Classifications MeSH