Semi-automated assembly of high-quality diploid human reference genomes.

Humans Chromosome Mapping / standards Diploidy Genome, Human / genetics Haplotypes / genetics High-Throughput Nucleotide Sequencing / methods Sequence Analysis, DNA / methods Reference Standards Genomics / methods Chromosomes, Human / genetics Genetic Variation / genetics

Journal

Nature

ISSN: 1476-4687

Titre abrégé: Nature

Pays: England

ID NLM: 0410462

Informations de publication

Date de publication:
Nov 2022

Historique:

received: 29 12 2021

accepted: 06 09 2022

pubmed: 20 10 2022

medline: 22 11 2022

entrez: 19 10 2022

Statut: ppublish

Résumé

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society

Identifiants

DOI: 10.1038/s41586-022-05325-5 PMID: 36261518 PMC: PMC9668749

pubmed: 36261518

doi: 10.1038/s41586-022-05325-5

pii: 10.1038/s41586-022-05325-5

pmc: PMC9668749

doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

519-531

Subventions

Organisme : NHGRI NIH HHS

ID : R01 HG006677

Pays : United States

Organisme : NHGRI NIH HHS

ID : U01 HG010961

Pays : United States

Organisme : NIGMS NIH HHS

ID : R35 GM130151

Pays : United States

Organisme : Howard Hughes Medical Institute

Pays : United States

Organisme : NHGRI NIH HHS

ID : R01 HG010169

Pays : United States

Organisme : NHGRI NIH HHS

ID : U01 HG010971

Pays : United States

Organisme : NHGRI NIH HHS

ID : R01 HG002385

Pays : United States

Organisme : NHGRI NIH HHS

ID : R01 HG010040

Pays : United States

Organisme : NHGRI NIH HHS

ID : U41 HG010972

Pays : United States

Informations de copyright

Références

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

pubmed: 11237011 doi: 10.1038/35057062

Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

pubmed: 28396521 pmcid: 5411779 doi: 10.1101/gr.213611.116

Sherman, R. M. & Salzberg, S. L. Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020).

pubmed: 32034321 pmcid: 7752153 doi: 10.1038/s41576-020-0210-7

Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

pubmed: 32504078 pmcid: 7877196 doi: 10.1038/s41576-020-0236-x

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

pubmed: 35357919 pmcid: 9186530 doi: 10.1126/science.abj6987

Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).

pubmed: 35444317 pmcid: 9402379 doi: 10.1038/s41586-022-04601-8

Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020).

pubmed: 31890139 doi: 10.1016/j.csbj.2019.11.002

1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

doi: 10.1038/nature15393

Ko, B. J. et al. Widespread false gene gains caused by duplication errors in genome assemblies. Genome Biol. https://doi.org/10.1186/s13059-022-02764-1 (2022).

Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

pubmed: 33632895 pmcid: 8026704 doi: 10.1126/science.abf7117

Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).

pubmed: 33911273 pmcid: 8081667 doi: 10.1038/s41586-021-03451-0

Kim, J. et al. False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. https://doi.org/10.1186/s13059-022-02765-0 (2022).

Cheng, Y., Berg, A., Wu, S., Li, Y. & Wu, R. Computing genetic imprinting expressed by haplotypes. Methods Mol. Biol. 573, 189–212 (2009).

pubmed: 19763929 doi: 10.1007/978-1-60761-247-6_11

Bailey-Wilson, J. E. & Wilson, A. F. Linkage analysis in the next-generation sequencing era. Hum. Hered. 72, 228–236 (2011).

pubmed: 22189465 pmcid: 3267991 doi: 10.1159/000334381

Li, Q. et al. Haplotyping by linked-read sequencing (HLRS) of the genetic disease carriers for preimplantation genetic testing without a proband or relatives. BMC Med. Genomics 13, 117 (2020).

pubmed: 32819358 pmcid: 7441613 doi: 10.1186/s12920-020-00766-1

Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

pubmed: 32033565 pmcid: 7006217 doi: 10.1186/s13059-020-1935-5

Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

pubmed: 32801147 pmcid: 7545148 doi: 10.1101/gr.263566.120

Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 15, e1007273 (2019).

pubmed: 31433799 pmcid: 6719893 doi: 10.1371/journal.pcbi.1007273

Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).

pubmed: 27467250 pmcid: 5596920 doi: 10.1016/j.cels.2015.07.012

Bocklandt, S., Hastie, A. & Cao, H. Bionano genome mapping: high-throughput, ultra-long molecule genome analysis system for precision genome assembly and haploid-resolved structural variation discovery. Adv. Exp. Med. Biol. 1129, 97–118 (2019).

pubmed: 30968363 doi: 10.1007/978-981-13-6037-4_7

Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

pubmed: 27749838 pmcid: 5503144 doi: 10.1038/nmeth.4035

Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).

doi: 10.1038/nbt.4277

Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1935 (2021).

pubmed: 33911078 pmcid: 8081726 doi: 10.1038/s41467-020-20536-y

Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).

pubmed: 22797899 pmcid: 3409785 doi: 10.1073/pnas.1201904109

Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

pubmed: 27271295 pmcid: 4896128 doi: 10.1038/sdata.2016.25

Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genomics. 2, 100128 (2022).

Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

pubmed: 25383537 doi: 10.1038/nature13907

Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

pubmed: 32686750 pmcid: 7483855 doi: 10.1038/s41587-020-0503-6

Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

pubmed: 33288905 doi: 10.1038/s41587-020-0711-0

Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

pubmed: 30936562 doi: 10.1038/s41587-019-0072-8

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

pubmed: 33526886 pmcid: 7961889 doi: 10.1038/s41592-020-01056-5

Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

pubmed: 23990416 pmcid: 3799473 doi: 10.1093/bioinformatics/btt476

Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).

pubmed: 33397900 pmcid: 7782737 doi: 10.1038/s41467-020-20236-7

Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).

Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

pubmed: 31819265 doi: 10.1038/s41592-019-0669-3

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

pubmed: 32928274 pmcid: 7488777 doi: 10.1186/s13059-020-02134-9

Formenti, G. et al. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome Biol. 22, 120 (2021).

pubmed: 33910595 pmcid: 8082918 doi: 10.1186/s13059-021-02336-9

Silkaitis, K. & Lemos, B. Sex-biased chromatin and regulatory cross-talk between sex chromosomes, autosomes, and mitochondria. Biol. Sex Differ. 5, 2 (2014).

pubmed: 24422881 pmcid: 3907150 doi: 10.1186/2042-6410-5-2

Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, 6588 (2022).

Howe, K. et al. Significantly improving the quality of genome assemblies through curation. Gigascience 10, giaa153 (2021).

pubmed: 33420778 pmcid: 7794651 doi: 10.1093/gigascience/giaa153

Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

pubmed: 30013044 pmcid: 6341484 doi: 10.1038/s41592-018-0054-7

Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

pubmed: 30858580 pmcid: 6699627 doi: 10.1038/s41587-019-0054-x

Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

Hui, J., Shomorony, I., Ramchandran, K. & Courtade, T. A. Overlap-based genome assembly from variable-length reads. In 2016 IEEE International Symposium on Information Theory (ISIT) 1018–1022 (IEEE, 2016).

Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 19, 696–704 (2022).

pubmed: 35361932 doi: 10.1038/s41592-022-01445-y

Olson, N. D. et al. Precision FDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. https://doi.org/10.1016/j.xgen.2022.100129 (2022).

Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

pubmed: 32541955 pmcid: 8454654 doi: 10.1038/s41587-020-0538-8

Yang, C. et al. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature 594, 227–233 (2021).

pubmed: 33910227 pmcid: 8189906 doi: 10.1038/s41586-021-03535-x

Samuels, D. C. et al. Heterozygosity ratio, a robust global genomic measure of autozygosity and its association with height and disease risk. Genetics 204, 893–904 (2016).

pubmed: 27585849 pmcid: 5105867 doi: 10.1534/genetics.116.189936

Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).

pubmed: 25621458 pmcid: 4405206 doi: 10.1038/ng.3200

Bosch, N. et al. Characterization and evolution of the novel gene family FAM90A in primates originated by multiple duplication and rearrangement events. Hum. Mol. Genet. 16, 2572–2582 (2007).

pubmed: 17684299 doi: 10.1093/hmg/ddm209

Cantsilieris, S. et al. An evolutionary driver of interspersed segmental duplications in primates. Genome Biol. 21, 202 (2020).

pubmed: 32778141 pmcid: 7419210 doi: 10.1186/s13059-020-02074-4

Ju, X.-C. et al. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. eLife 5, e18197 (2016).

pubmed: 27504805 pmcid: 5028191 doi: 10.7554/eLife.18197

Wu, Z. et al. Copy number variation of the lipoprotein(a) (LPA) gene is associated with coronary artery disease in a southern Han Chinese population. Int. J. Clin. Exp. Med. 7, 3669–3677 (2014).

pubmed: 25419416 pmcid: 4238520

McBride, C. S. Rapid evolution of smell and taste receptor genes during host specialization in Drosophila sechellia. Proc. Natl Acad. Sci. USA 104, 4996–5001 (2007).

pubmed: 17360391 pmcid: 1829253 doi: 10.1073/pnas.0608424104

Jalili, V. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res. 48, W395–W402 (2020).

pubmed: 32479607 pmcid: 7319590 doi: 10.1093/nar/gkaa434

Liao, W.-W. et al. A draft human pangenome reference. Preprint at bioRxiv https://doi.org/10.1101/2022.07.09.499321 (2022).

Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

pubmed: 33288906 doi: 10.1038/s41587-020-0719-5

Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol.40, 1332–1335 (2022).

Shumate, A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome Biol. 21, 129 (2020).

pubmed: 32487205 pmcid: 7265644 doi: 10.1186/s13059-020-02047-7

Garg, S. Computational methods for chromosome-scale haplotype reconstruction. Genome Biol. 22, 101 (2021).

pubmed: 33845884 pmcid: 8040228 doi: 10.1186/s13059-021-02328-9

Rautiainen, M. et al. Verkko: telomere-to-telomere assembly of diploid chromosomes. Preprint at bioRxiv https://doi.org/10.1101/2022.06.24.497523 (2022).

Chen, Z. et al. Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information. Genome Res. 30, 898–909 (2020).

pubmed: 32540955 pmcid: 7370886 doi: 10.1101/gr.260380.119

Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

pubmed: 29431738 pmcid: 5889714 doi: 10.1038/nbt.4060

Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).

pubmed: 31971576 pmcid: 7203741 doi: 10.1093/bioinformatics/btaa025

Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

pubmed: 30247488 doi: 10.1038/nbt.4235

Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

pubmed: 27940952 pmcid: 5411775 doi: 10.1101/gr.213462.116

Porubsky, D. et al. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat. Commun. 8, 1293 (2017).

pubmed: 29101320 pmcid: 5670131 doi: 10.1038/s41467-017-01389-4

Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).

pubmed: 23042453 pmcid: 3580294 doi: 10.1038/nmeth.2206

Ghareghani, M. et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics 34, i115–i123 (2018).

pubmed: 29949971 pmcid: 6022540 doi: 10.1093/bioinformatics/bty290

Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

pubmed: 28100585 pmcid: 5411768 doi: 10.1101/gr.214270.116

Kirsche, M. et al. Jasmine: population-scale structural variant comparison and analysis. Preprint at bioRxiv https://doi.org/10.1101/2021.05.27.445886 (2021).

Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).

pubmed: 21811232 pmcid: 3208341 doi: 10.1038/msb.2011.54

Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

pubmed: 30559433 doi: 10.1038/s41592-018-0236-3

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

pubmed: 20080505 pmcid: 2828108 doi: 10.1093/bioinformatics/btp698

Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

pubmed: 19505943 pmcid: 2723002 doi: 10.1093/bioinformatics/btp352

Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of NGS alignment formats. Bioinformatics 31, 2032–2034 (2015).

pubmed: 25697820 pmcid: 4765878 doi: 10.1093/bioinformatics/btv098

Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data. Bioinformatics 36, 1260–1261 (2020).

pubmed: 31504176

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

pubmed: 29750242 pmcid: 6137996 doi: 10.1093/bioinformatics/bty191

Chin, C.-S. et al. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat. Commun. 11, 4794 (2020).

pubmed: 32963235 pmcid: 7508831 doi: 10.1038/s41467-020-18564-9

Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

pubmed: 22908215 doi: 10.1093/bioinformatics/bts480

Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).

pubmed: 29220515 doi: 10.1093/molbev/msx319

O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

pubmed: 26553804 doi: 10.1093/nar/gkv1189

Smit, A. F. A., Hubley, R. & Green, P. Repeatmasker Open 3.0 (Institute of Systems Biology, 1996).

Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).

pubmed: 16287941 doi: 10.1093/bioinformatics/bti774

Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).

pubmed: 18495041 pmcid: 2440734 doi: 10.1186/1745-6150-3-20

Brown, G. R. et al. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43, D36–D42 (2015).

pubmed: 25355515 doi: 10.1093/nar/gku1055

Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Preprint at bioRxiv https://doi.org/10.1101/2022.02.14.480413 (2022).

Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).

pubmed: 35552372 pmcid: 9237687 doi: 10.1093/bioinformatics/btac308

Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).

pubmed: 31842948 pmcid: 6913012 doi: 10.1186/s13059-019-1911-0

Semi-automated assembly of high-quality diploid human reference genomes.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Articles similaires

Classifications MeSH