Improved sequence mapping using a complete reference genome and lift-over.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
30 Nov 2023
Historique:
received: 27 04 2022
accepted: 09 10 2023
medline: 1 12 2023
pubmed: 1 12 2023
entrez: 30 11 2023
Statut: aheadofprint

Résumé

Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.

Identifiants

pubmed: 38036856
doi: 10.1038/s41592-023-02069-6
pii: 10.1038/s41592-023-02069-6
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG011392
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35 GM139602
Pays : United States

Informations de copyright

© 2023. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Schneider, V. A. et al. Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
doi: 10.1101/gr.213611.116 pubmed: 28396521 pmcid: 5411779
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
doi: 10.1126/science.abj6987 pubmed: 35357919 pmcid: 9186530
Guo, Y. et al. Improvements and impacts of grch38 human reference on high throughput sequencing data analysis. Genomics 109, 83–90 (2017).
doi: 10.1016/j.ygeno.2017.01.005 pubmed: 28131802
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).
doi: 10.1126/science.abl3533 pubmed: 35357935 pmcid: 9336181
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
doi: 10.1038/s41586-020-2308-7 pubmed: 32461654 pmcid: 7334197
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
doi: 10.1371/journal.pmed.1001779 pubmed: 25826379 pmcid: 4380465
Smigielski, E. M., Sirotkin, K., Ward, M. & Sherry, S. T. dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28, 352–355 (2000).
doi: 10.1093/nar/28.1.352 pubmed: 10592272 pmcid: 102496
Mailman, M. D. et al. The NCBI dbGAP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).
doi: 10.1038/ng1007-1181 pubmed: 17898773 pmcid: 2031016
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature 590, 290–299 (2021).
doi: 10.1038/s41586-021-03205-y pubmed: 33568819 pmcid: 7875770
Consortium, G. The GTEX Consortium Atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
doi: 10.1126/science.aaz1776
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
doi: 10.1093/nar/gky955 pubmed: 30357393
Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 genomes project. Wellcome Open Res. 4, 50 (2019).
doi: 10.12688/wellcomeopenres.15126.2 pubmed: 32175479 pmcid: 7059836
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
doi: 10.1186/s13059-019-1715-2 pubmed: 31097009 pmcid: 6521345
Gao, G. F. et al. Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data. Cell Syst. 9, 24–34 (2019).
doi: 10.1016/j.cels.2019.06.006 pubmed: 31344359 pmcid: 6707074
Lansdon, L. A. et al. Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing. J. Mol. Diagn. 23, 651–657 (2021).
doi: 10.1016/j.jmoldx.2021.02.003 pubmed: 33631350
Fujita, P. A. et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2010).
doi: 10.1093/nar/gkq963 pubmed: 20959295 pmcid: 3242726
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
doi: 10.1093/bioinformatics/btt730 pubmed: 24351709
Picard toolkit. GitHub https://broadinstitute.github.io/picard/ (2019).
Mun, T., Chen, N.-C. & Langmead, B. Leviosam: fast lift-over of variant-aware reference alignments. Bioinformatics 37, 4243–4245 (2021).
Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics 20, 17–29 (2019).
Ormond, C., Ryan, N. M., Corvin, A. & Heron, E. A. Converting single nucleotide variants between genome builds: from cautionary tale to solution. Brief. Bioinform. 22, bbab069 (2021).
doi: 10.1093/bib/bbab069 pubmed: 33822888 pmcid: 8425424
Li, H. et al. Exome variant discrepancies due to reference genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).
Lansdon, L. A. et al. Clinical validation of genome reference consortium human build 38 in a laboratory utilizing next-generation sequencing technologies. Clin. Chem. 68, 1177–1183 (2022).
doi: 10.1093/clinchem/hvac113 pubmed: 35869940
Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
doi: 10.1093/bioinformatics/btaa1016 pubmed: 33320174 pmcid: 8289374
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
doi: 10.1186/s13059-020-02229-3 pubmed: 33397413 pmcid: 7780692
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
doi: 10.1016/j.xgen.2022.100128 pubmed: 36452119 pmcid: 9706577
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
doi: 10.1038/s41587-019-0217-9 pubmed: 31406327 pmcid: 6776680
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Holtgrewe, M. Mason: A Read Simulator for Second Generation Sequencing Data. Report No. TR-B-10-06 (Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin, 2010).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
doi: 10.1038/nbt.4060 pubmed: 29431738 pmcid: 5889714
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191 pubmed: 29750242 pmcid: 6137996
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using winnowmap2. Nat. Methods 19, 705–710 (2022).
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
doi: 10.1038/s41586-022-05325-5 pubmed: 36261518 pmcid: 9668749
Smolka, M. et al. Comprehensive structural variant detection: from mosaic to population-level. Preprint at bioRxiv https://doi.org/10.1101/2022.04.04.487055 (2022).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
doi: 10.1186/s13059-022-02840-6 pubmed: 36575487 pmcid: 9793516
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
doi: 10.1038/s41592-020-01056-5 pubmed: 33526886 pmcid: 7961889
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
doi: 10.1038/gim.2016.58 pubmed: 27228465
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
doi: 10.1093/bib/bbs017 pubmed: 22517427
Talenti, A. & Prendergast, J. nf-LO: a scalable, containerized workflow for genome-to-genome lift over. Genome Biol. Evol. 13, evab183 (2021).
doi: 10.1093/gbe/evab183 pubmed: 34383887 pmcid: 8412297
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).
doi: 10.1093/bioinformatics/btac743 pubmed: 36448683
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).
doi: 10.1093/nar/27.11.2369 pubmed: 10325427 pmcid: 148804
Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat. Methods 20, 1213–1221 (2023).
doi: 10.1038/s41592-023-01914-y pubmed: 37365340 pmcid: 10406601
Gog, S., Beller, T., Moffat, A. & Petri, M. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (eds. Gudmundsson, J. & Katajainen, J.) 326–337 (SEA, 2014).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
doi: 10.1093/bioinformatics/btw152 pubmed: 27153593 pmcid: 4937194
Rapid yaml. GitHub https://github.com/biojppm/rapidyaml (2022).
Bonfield, J. K. et al. Htslib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).
doi: 10.1093/gigascience/giab007 pubmed: 33594436 pmcid: 7931820
Pockrandt, C., Alzamel, M., Iliopoulos, C. S. & Reinert, K. GenMap: ultra-fast computation of genome mappability. Bioinformatics 36, 3687–3692 (2020).
doi: 10.1093/bioinformatics/btaa222 pubmed: 32246826 pmcid: 7320602
Leitner-Ankerl, M. Robin hood unordered map and set. GitHub https://github.com/martinus/robin-hood-hashing (2022).
Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
doi: 10.1093/bioinformatics/btq033 pubmed: 20110278 pmcid: 2832824
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
doi: 10.1093/gigascience/giab008 pubmed: 33590861 pmcid: 7931819
Martin, M. et al. Whatshap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Cook, D., Kolesnikov, A., Chang, P.-C. & Carroll, A. Improving variant calling using haplotype information. DeepVariant Blog https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/ (2021).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Gordon, A. Gnu time. https://www.gnu.org/software/time/ (2018).
Chen, N.-C. leviosam2. Zenodo https://doi.org/10.5281/zenodo.8198490 (2023).
Chen, N.-C. levioSAM2-experiments v.0.1. Zenodo https://doi.org/10.5281/zenodo.8198541 (2023).

Auteurs

Nae-Chyun Chen (NC)

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. cnaechy1@jhu.edu.

Luis F Paulin (LF)

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.

Fritz J Sedlazeck (FJ)

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
Department of Computer Science, Rice University, Houston, TX, USA.

Sergey Koren (S)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Adam M Phillippy (AM)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Ben Langmead (B)

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. langmea@cs.jhu.edu.

Classifications MeSH