Improved sequence mapping using a complete reference genome and lift-over.
Journal
Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604
Informations de publication
Date de publication:
30 Nov 2023
30 Nov 2023
Historique:
received:
27
04
2022
accepted:
09
10
2023
medline:
1
12
2023
pubmed:
1
12
2023
entrez:
30
11
2023
Statut:
aheadofprint
Résumé
Complete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map. In addition to enabling the use of several references, we demonstrate that aligning reads to a high-quality reference (for example, T2T-CHM13) and lifting to an older reference (for example, Genome reference Consortium (GRC)h38) improves the accuracy of the resulting variant calls on the old reference. By leveraging the quality improvements of T2T-CHM13, levioSAM2 reduces small and structural variant calling errors compared with GRC-based mapping using real short- and long-read datasets. Performance is especially improved for a set of complex medically relevant genes, where the GRC references are lower quality.
Identifiants
pubmed: 38036856
doi: 10.1038/s41592-023-02069-6
pii: 10.1038/s41592-023-02069-6
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG011392
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35 GM139602
Pays : United States
Informations de copyright
© 2023. The Author(s), under exclusive licence to Springer Nature America, Inc.
Références
Schneider, V. A. et al. Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
doi: 10.1101/gr.213611.116
pubmed: 28396521
pmcid: 5411779
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
doi: 10.1126/science.abj6987
pubmed: 35357919
pmcid: 9186530
Guo, Y. et al. Improvements and impacts of grch38 human reference on high throughput sequencing data analysis. Genomics 109, 83–90 (2017).
doi: 10.1016/j.ygeno.2017.01.005
pubmed: 28131802
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).
doi: 10.1126/science.abl3533
pubmed: 35357935
pmcid: 9336181
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
doi: 10.1038/s41586-020-2308-7
pubmed: 32461654
pmcid: 7334197
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
doi: 10.1371/journal.pmed.1001779
pubmed: 25826379
pmcid: 4380465
Smigielski, E. M., Sirotkin, K., Ward, M. & Sherry, S. T. dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28, 352–355 (2000).
doi: 10.1093/nar/28.1.352
pubmed: 10592272
pmcid: 102496
Mailman, M. D. et al. The NCBI dbGAP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).
doi: 10.1038/ng1007-1181
pubmed: 17898773
pmcid: 2031016
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature 590, 290–299 (2021).
doi: 10.1038/s41586-021-03205-y
pubmed: 33568819
pmcid: 7875770
Consortium, G. The GTEX Consortium Atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
doi: 10.1126/science.aaz1776
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
doi: 10.1093/nar/gky955
pubmed: 30357393
Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 genomes project. Wellcome Open Res. 4, 50 (2019).
doi: 10.12688/wellcomeopenres.15126.2
pubmed: 32175479
pmcid: 7059836
Salzberg, S. L. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
doi: 10.1186/s13059-019-1715-2
pubmed: 31097009
pmcid: 6521345
Gao, G. F. et al. Before and after: comparison of legacy and harmonized TCGA genomic data commons’ data. Cell Syst. 9, 24–34 (2019).
doi: 10.1016/j.cels.2019.06.006
pubmed: 31344359
pmcid: 6707074
Lansdon, L. A. et al. Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing. J. Mol. Diagn. 23, 651–657 (2021).
doi: 10.1016/j.jmoldx.2021.02.003
pubmed: 33631350
Fujita, P. A. et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 39, D876–D882 (2010).
doi: 10.1093/nar/gkq963
pubmed: 20959295
pmcid: 3242726
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
doi: 10.1093/bioinformatics/btt730
pubmed: 24351709
Picard toolkit. GitHub https://broadinstitute.github.io/picard/ (2019).
Mun, T., Chen, N.-C. & Langmead, B. Leviosam: fast lift-over of variant-aware reference alignments. Bioinformatics 37, 4243–4245 (2021).
Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics 20, 17–29 (2019).
Ormond, C., Ryan, N. M., Corvin, A. & Heron, E. A. Converting single nucleotide variants between genome builds: from cautionary tale to solution. Brief. Bioinform. 22, bbab069 (2021).
doi: 10.1093/bib/bbab069
pubmed: 33822888
pmcid: 8425424
Li, H. et al. Exome variant discrepancies due to reference genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).
Lansdon, L. A. et al. Clinical validation of genome reference consortium human build 38 in a laboratory utilizing next-generation sequencing technologies. Clin. Chem. 68, 1177–1183 (2022).
doi: 10.1093/clinchem/hvac113
pubmed: 35869940
Behera, S. et al. FixItFelix: improving genomic analysis by fixing reference errors. Genome Biol. 24, 31 (2023).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
doi: 10.1093/bioinformatics/btaa1016
pubmed: 33320174
pmcid: 8289374
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
doi: 10.1186/s13059-020-02229-3
pubmed: 33397413
pmcid: 7780692
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
doi: 10.1016/j.xgen.2022.100128
pubmed: 36452119
pmcid: 9706577
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
doi: 10.1038/s41587-019-0217-9
pubmed: 31406327
pmcid: 6776680
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Holtgrewe, M. Mason: A Read Simulator for Second Generation Sequencing Data. Report No. TR-B-10-06 (Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin, 2010).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
doi: 10.1038/nbt.4060
pubmed: 29431738
pmcid: 5889714
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191
pubmed: 29750242
pmcid: 6137996
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using winnowmap2. Nat. Methods 19, 705–710 (2022).
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
doi: 10.1038/s41586-022-05325-5
pubmed: 36261518
pmcid: 9668749
Smolka, M. et al. Comprehensive structural variant detection: from mosaic to population-level. Preprint at bioRxiv https://doi.org/10.1101/2022.04.04.487055 (2022).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
doi: 10.1186/s13059-022-02840-6
pubmed: 36575487
pmcid: 9793516
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
doi: 10.1038/s41592-020-01056-5
pubmed: 33526886
pmcid: 7961889
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
doi: 10.1038/gim.2016.58
pubmed: 27228465
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
doi: 10.1093/bib/bbs017
pubmed: 22517427
Talenti, A. & Prendergast, J. nf-LO: a scalable, containerized workflow for genome-to-genome lift over. Genome Biol. Evol. 13, evab183 (2021).
doi: 10.1093/gbe/evab183
pubmed: 34383887
pmcid: 8412297
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).
doi: 10.1093/bioinformatics/btac743
pubmed: 36448683
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).
doi: 10.1093/nar/27.11.2369
pubmed: 10325427
pmcid: 148804
Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat. Methods 20, 1213–1221 (2023).
doi: 10.1038/s41592-023-01914-y
pubmed: 37365340
pmcid: 10406601
Gog, S., Beller, T., Moffat, A. & Petri, M. From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (eds. Gudmundsson, J. & Katajainen, J.) 326–337 (SEA, 2014).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
doi: 10.1093/bioinformatics/btw152
pubmed: 27153593
pmcid: 4937194
Rapid yaml. GitHub https://github.com/biojppm/rapidyaml (2022).
Bonfield, J. K. et al. Htslib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).
doi: 10.1093/gigascience/giab007
pubmed: 33594436
pmcid: 7931820
Pockrandt, C., Alzamel, M., Iliopoulos, C. S. & Reinert, K. GenMap: ultra-fast computation of genome mappability. Bioinformatics 36, 3687–3692 (2020).
doi: 10.1093/bioinformatics/btaa222
pubmed: 32246826
pmcid: 7320602
Leitner-Ankerl, M. Robin hood unordered map and set. GitHub https://github.com/martinus/robin-hood-hashing (2022).
Quinlan, A. R. & Hall, I. M. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
doi: 10.1093/bioinformatics/btq033
pubmed: 20110278
pmcid: 2832824
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
doi: 10.1093/gigascience/giab008
pubmed: 33590861
pmcid: 7931819
Martin, M. et al. Whatshap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Cook, D., Kolesnikov, A., Chang, P.-C. & Carroll, A. Improving variant calling using haplotype information. DeepVariant Blog https://google.github.io/deepvariant/posts/2021-02-08-the-haplotype-channel/ (2021).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Gordon, A. Gnu time. https://www.gnu.org/software/time/ (2018).
Chen, N.-C. leviosam2. Zenodo https://doi.org/10.5281/zenodo.8198490 (2023).
Chen, N.-C. levioSAM2-experiments v.0.1. Zenodo https://doi.org/10.5281/zenodo.8198541 (2023).