Semi-automated assembly of high-quality diploid human reference genomes.
Journal
Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462
Informations de publication
Date de publication:
Nov 2022
Nov 2022
Historique:
received:
29
12
2021
accepted:
06
09
2022
pubmed:
20
10
2022
medline:
22
11
2022
entrez:
19
10
2022
Statut:
ppublish
Résumé
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society
Identifiants
pubmed: 36261518
doi: 10.1038/s41586-022-05325-5
pii: 10.1038/s41586-022-05325-5
pmc: PMC9668749
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
519-531Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG006677
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010961
Pays : United States
Organisme : NIGMS NIH HHS
ID : R35 GM130151
Pays : United States
Organisme : Howard Hughes Medical Institute
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010169
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG002385
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010040
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG010972
Pays : United States
Informations de copyright
© 2022. The Author(s).
Références
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
pubmed: 11237011
doi: 10.1038/35057062
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
pubmed: 28396521
pmcid: 5411779
doi: 10.1101/gr.213611.116
Sherman, R. M. & Salzberg, S. L. Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020).
pubmed: 32034321
pmcid: 7752153
doi: 10.1038/s41576-020-0210-7
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
pubmed: 32504078
pmcid: 7877196
doi: 10.1038/s41576-020-0236-x
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
pubmed: 35357919
pmcid: 9186530
doi: 10.1126/science.abj6987
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
pubmed: 35444317
pmcid: 9402379
doi: 10.1038/s41586-022-04601-8
Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020).
pubmed: 31890139
doi: 10.1016/j.csbj.2019.11.002
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
doi: 10.1038/nature15393
Ko, B. J. et al. Widespread false gene gains caused by duplication errors in genome assemblies. Genome Biol. https://doi.org/10.1186/s13059-022-02764-1 (2022).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
pubmed: 33632895
pmcid: 8026704
doi: 10.1126/science.abf7117
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
pubmed: 33911273
pmcid: 8081667
doi: 10.1038/s41586-021-03451-0
Kim, J. et al. False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. https://doi.org/10.1186/s13059-022-02765-0 (2022).
Cheng, Y., Berg, A., Wu, S., Li, Y. & Wu, R. Computing genetic imprinting expressed by haplotypes. Methods Mol. Biol. 573, 189–212 (2009).
pubmed: 19763929
doi: 10.1007/978-1-60761-247-6_11
Bailey-Wilson, J. E. & Wilson, A. F. Linkage analysis in the next-generation sequencing era. Hum. Hered. 72, 228–236 (2011).
pubmed: 22189465
pmcid: 3267991
doi: 10.1159/000334381
Li, Q. et al. Haplotyping by linked-read sequencing (HLRS) of the genetic disease carriers for preimplantation genetic testing without a proband or relatives. BMC Med. Genomics 13, 117 (2020).
pubmed: 32819358
pmcid: 7441613
doi: 10.1186/s12920-020-00766-1
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
pubmed: 32033565
pmcid: 7006217
doi: 10.1186/s13059-020-1935-5
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
pubmed: 32801147
pmcid: 7545148
doi: 10.1101/gr.263566.120
Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 15, e1007273 (2019).
pubmed: 31433799
pmcid: 6719893
doi: 10.1371/journal.pcbi.1007273
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
pubmed: 27467250
pmcid: 5596920
doi: 10.1016/j.cels.2015.07.012
Bocklandt, S., Hastie, A. & Cao, H. Bionano genome mapping: high-throughput, ultra-long molecule genome analysis system for precision genome assembly and haploid-resolved structural variation discovery. Adv. Exp. Med. Biol. 1129, 97–118 (2019).
pubmed: 30968363
doi: 10.1007/978-981-13-6037-4_7
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
pubmed: 27749838
pmcid: 5503144
doi: 10.1038/nmeth.4035
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
doi: 10.1038/nbt.4277
Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1935 (2021).
pubmed: 33911078
pmcid: 8081726
doi: 10.1038/s41467-020-20536-y
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
pubmed: 22797899
pmcid: 3409785
doi: 10.1073/pnas.1201904109
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
pubmed: 27271295
pmcid: 4896128
doi: 10.1038/sdata.2016.25
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genomics. 2, 100128 (2022).
Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).
pubmed: 25383537
doi: 10.1038/nature13907
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
pubmed: 32686750
pmcid: 7483855
doi: 10.1038/s41587-020-0503-6
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
pubmed: 33288905
doi: 10.1038/s41587-020-0711-0
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
pubmed: 30936562
doi: 10.1038/s41587-019-0072-8
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
pubmed: 33526886
pmcid: 7961889
doi: 10.1038/s41592-020-01056-5
Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
pubmed: 23990416
pmcid: 3799473
doi: 10.1093/bioinformatics/btt476
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).
pubmed: 33397900
pmcid: 7782737
doi: 10.1038/s41467-020-20236-7
Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
pubmed: 31819265
doi: 10.1038/s41592-019-0669-3
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
pubmed: 32928274
pmcid: 7488777
doi: 10.1186/s13059-020-02134-9
Formenti, G. et al. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome Biol. 22, 120 (2021).
pubmed: 33910595
pmcid: 8082918
doi: 10.1186/s13059-021-02336-9
Silkaitis, K. & Lemos, B. Sex-biased chromatin and regulatory cross-talk between sex chromosomes, autosomes, and mitochondria. Biol. Sex Differ. 5, 2 (2014).
pubmed: 24422881
pmcid: 3907150
doi: 10.1186/2042-6410-5-2
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, 6588 (2022).
Howe, K. et al. Significantly improving the quality of genome assemblies through curation. Gigascience 10, giaa153 (2021).
pubmed: 33420778
pmcid: 7794651
doi: 10.1093/gigascience/giaa153
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
pubmed: 30013044
pmcid: 6341484
doi: 10.1038/s41592-018-0054-7
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
pubmed: 30858580
pmcid: 6699627
doi: 10.1038/s41587-019-0054-x
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Hui, J., Shomorony, I., Ramchandran, K. & Courtade, T. A. Overlap-based genome assembly from variable-length reads. In 2016 IEEE International Symposium on Information Theory (ISIT) 1018–1022 (IEEE, 2016).
Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 19, 696–704 (2022).
pubmed: 35361932
doi: 10.1038/s41592-022-01445-y
Olson, N. D. et al. Precision FDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. https://doi.org/10.1016/j.xgen.2022.100129 (2022).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
pubmed: 32541955
pmcid: 8454654
doi: 10.1038/s41587-020-0538-8
Yang, C. et al. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature 594, 227–233 (2021).
pubmed: 33910227
pmcid: 8189906
doi: 10.1038/s41586-021-03535-x
Samuels, D. C. et al. Heterozygosity ratio, a robust global genomic measure of autozygosity and its association with height and disease risk. Genetics 204, 893–904 (2016).
pubmed: 27585849
pmcid: 5105867
doi: 10.1534/genetics.116.189936
Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).
pubmed: 25621458
pmcid: 4405206
doi: 10.1038/ng.3200
Bosch, N. et al. Characterization and evolution of the novel gene family FAM90A in primates originated by multiple duplication and rearrangement events. Hum. Mol. Genet. 16, 2572–2582 (2007).
pubmed: 17684299
doi: 10.1093/hmg/ddm209
Cantsilieris, S. et al. An evolutionary driver of interspersed segmental duplications in primates. Genome Biol. 21, 202 (2020).
pubmed: 32778141
pmcid: 7419210
doi: 10.1186/s13059-020-02074-4
Ju, X.-C. et al. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. eLife 5, e18197 (2016).
pubmed: 27504805
pmcid: 5028191
doi: 10.7554/eLife.18197
Wu, Z. et al. Copy number variation of the lipoprotein(a) (LPA) gene is associated with coronary artery disease in a southern Han Chinese population. Int. J. Clin. Exp. Med. 7, 3669–3677 (2014).
pubmed: 25419416
pmcid: 4238520
McBride, C. S. Rapid evolution of smell and taste receptor genes during host specialization in Drosophila sechellia. Proc. Natl Acad. Sci. USA 104, 4996–5001 (2007).
pubmed: 17360391
pmcid: 1829253
doi: 10.1073/pnas.0608424104
Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).
pubmed: 32479607
pmcid: 7319590
doi: 10.1093/nar/gkaa434
Liao, W.-W. et al. A draft human pangenome reference. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499321 (2022).
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
pubmed: 33288906
doi: 10.1038/s41587-020-0719-5
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol.40, 1332–1335 (2022).
Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).
pubmed: 32487205
pmcid: 7265644
doi: 10.1186/s13059-020-02047-7
Garg, S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. 22, 101 (2021).
pubmed: 33845884
pmcid: 8040228
doi: 10.1186/s13059-021-02328-9
Rautiainen, M. et al. Verkko: telomere-to-telomere assembly of diploid chromosomes. Preprint at bioRxiv https://doi.org/10.1101/2022.06.24.497523 (2022).
Chen, Z. et al. Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information. Genome Res. 30, 898–909 (2020).
pubmed: 32540955
pmcid: 7370886
doi: 10.1101/gr.260380.119
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
pubmed: 29431738
pmcid: 5889714
doi: 10.1038/nbt.4060
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
pubmed: 31971576
pmcid: 7203741
doi: 10.1093/bioinformatics/btaa025
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
pubmed: 30247488
doi: 10.1038/nbt.4235
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
pubmed: 27940952
pmcid: 5411775
doi: 10.1101/gr.213462.116
Porubsky, D. et al. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat. Commun. 8, 1293 (2017).
pubmed: 29101320
pmcid: 5670131
doi: 10.1038/s41467-017-01389-4
Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
pubmed: 23042453
pmcid: 3580294
doi: 10.1038/nmeth.2206
Ghareghani, M. et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics 34, i115–i123 (2018).
pubmed: 29949971
pmcid: 6022540
doi: 10.1093/bioinformatics/bty290
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
pubmed: 28100585
pmcid: 5411768
doi: 10.1101/gr.214270.116
Kirsche, M. et al. Jasmine: population-scale structural variant comparison and analysis. Preprint at bioRxiv https://doi.org/10.1101/2021.05.27.445886 (2021).
Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).
pubmed: 21811232
pmcid: 3208341
doi: 10.1038/msb.2011.54
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
pubmed: 30559433
doi: 10.1038/s41592-018-0236-3
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
pubmed: 20080505
pmcid: 2828108
doi: 10.1093/bioinformatics/btp698
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
pubmed: 19505943
pmcid: 2723002
doi: 10.1093/bioinformatics/btp352
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of NGS alignment formats. Bioinformatics 31, 2032–2034 (2015).
pubmed: 25697820
pmcid: 4765878
doi: 10.1093/bioinformatics/btv098
Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).
pubmed: 31504176
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
pubmed: 29750242
pmcid: 6137996
doi: 10.1093/bioinformatics/bty191
Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).
pubmed: 32963235
pmcid: 7508831
doi: 10.1038/s41467-020-18564-9
Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
pubmed: 22908215
doi: 10.1093/bioinformatics/bts480
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).
pubmed: 29220515
doi: 10.1093/molbev/msx319
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
pubmed: 26553804
doi: 10.1093/nar/gkv1189
Smit, A. F. A., Hubley, R. & Green, P. Repeatmasker Open 3.0 (Institute of Systems Biology, 1996).
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).
pubmed: 16287941
doi: 10.1093/bioinformatics/bti774
Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).
pubmed: 18495041
pmcid: 2440734
doi: 10.1186/1745-6150-3-20
Brown, G. R. et al. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43, D36–D42 (2015).
pubmed: 25355515
doi: 10.1093/nar/gku1055
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Preprint at bioRxiv https://doi.org/10.1101/2022.02.14.480413 (2022).
Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).
pubmed: 35552372
pmcid: 9237687
doi: 10.1093/bioinformatics/btac308
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
pubmed: 31842948
pmcid: 6913012
doi: 10.1186/s13059-019-1911-0