Telomere-to-telomere genome assembly of sorghum.

Sorghum / genetics Genome, Plant Telomere / genetics

Journal

Scientific data

ISSN: 2052-4463

Titre abrégé: Sci Data

Pays: England

ID NLM: 101640192

Informations de publication

Date de publication:
02 Aug 2024

Historique:

received: 28 11 2023

accepted: 19 07 2024

medline: 3 8 2024

pubmed: 3 8 2024

entrez: 2 8 2024

Statut: epublish

Résumé

"Cuohu Bazi" (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.

Identifiants

DOI: 10.1038/s41597-024-03664-8 PMID: 39095379

pubmed: 39095379

doi: 10.1038/s41597-024-03664-8

pii: 10.1038/s41597-024-03664-8

doi:

Types de publication

Dataset Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

835

Informations de copyright

Références

Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017).

Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).

doi: 10.1038/nature07723 pubmed: 19189423

Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications 9 (2018).

Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics 20 (2019).

Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants 7, 766–773 (2021).

doi: 10.1038/s41477-021-00925-x pubmed: 34017083

Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology 23 (2023).

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2021).

doi: 10.1126/science.abj6987

Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022).

Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research 10 (2023).

Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research 10 (2023).

Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal 20, 1373–1386 (2021).

doi: 10.1111/pbi.13816

Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant 16, 1232–1236 (2023).

doi: 10.1016/j.molp.2023.08.003 pubmed: 37553831

Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55, 1221–1231 (2023).

doi: 10.1038/s41588-023-01419-6 pubmed: 37322109 pmcid: 10335936

Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).

doi: 10.1016/j.ymeth.2012.05.001 pubmed: 22652625

Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).

doi: 10.1038/nmeth.2474 pubmed: 23644548

Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).

Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 1 (2018).

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 1–6 (2021).

doi: 10.1038/s41592-020-01056-5

Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

doi: 10.1093/bioinformatics/btp324 pubmed: 19451168 pmcid: 2705234

Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology 16 (2015).

Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems 3, 95–98 (2016).

doi: 10.1016/j.cels.2016.07.002 pubmed: 27467249 pmcid: 5846465

Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 6333 (2017).

doi: 10.1126/science.aal3327

Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 1 (2019).

Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications 12 (2021).

Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods 19, 687–695 (2022).

doi: 10.1038/s41592-022-01440-3 pubmed: 35361931 pmcid: 9812399

Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods 19, 705–710 (2022).

doi: 10.1038/s41592-022-01457-8 pubmed: 35365778 pmcid: 10510034

Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research 27, 737–746 (2017).

doi: 10.1101/gr.214270.116 pubmed: 28100585 pmcid: 5411768

Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics 4 (2004).

Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).

doi: 10.1093/nar/gkm286 pubmed: 17485477 pmcid: 1933203

Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–i358 (2005).

doi: 10.1093/bioinformatics/bti1018 pubmed: 15961478

Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6 (2015).

Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).

doi: 10.1093/nar/27.2.573 pubmed: 9862982 pmcid: 148217

Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011).

doi: 10.1038/nbt.1883 pubmed: 21572440 pmcid: 3571712

Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666 (2003).

doi: 10.1093/nar/gkg770 pubmed: 14500829 pmcid: 206470

Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360 (2015).

doi: 10.1038/nmeth.3317 pubmed: 25751142 pmcid: 4655817

Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).

doi: 10.1186/s13059-019-1910-1 pubmed: 31842956 pmcid: 6912988

Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant 15, 1247–1250 (2022).

doi: 10.1016/j.molp.2022.05.014 pubmed: 35655433

Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019).

Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 33, W465–W457 (2005).

doi: 10.1093/nar/gki458 pubmed: 15980513 pmcid: 1160219

Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).

doi: 10.1186/gb-2008-9-1-r7 pubmed: 18190707 pmcid: 2395244

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60 (2015).

doi: 10.1038/nmeth.3176 pubmed: 25402007

Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54 (1999).

doi: 10.1093/nar/27.1.49 pubmed: 9847139 pmcid: 148094

Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30 (2000).

doi: 10.1093/nar/28.1.27 pubmed: 10592173 pmcid: 102409

Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

doi: 10.1093/bioinformatics/btu031 pubmed: 24451626 pmcid: 3998142

Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

doi: 10.1093/bioinformatics/bty560 pubmed: 30423086 pmcid: 6129281

Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT

doi: 10.1038/s41587-019-0201-4 pubmed: 31375807 pmcid: 7605509

Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

doi: 10.1093/bioinformatics/btu638 pubmed: 25260700

Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology 12, R22 (2011).

doi: 10.1186/gb-2011-12-3-r22 pubmed: 21410973 pmcid: 3129672

Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488 (2008).

doi: 10.1126/science.1153917 pubmed: 18436778

Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).

doi: 10.1093/bioinformatics/bty191

Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023).

Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023).

Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research 10 (2023).

NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).

NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).

Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare. https://doi.org/10.6084/m9.figshare.24532924.v1 (2024).

Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution 35, 543–548 (2018).

doi: 10.1093/molbev/msx319 pubmed: 29220515

Telomere-to-telomere genome assembly of sorghum.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Meng Li (M)

Chunhai Chen (C)

Haigang Wang (H)

Huibin Qin (H)

Sen Hou (S)

Xukui Yang (X)

Jianbo Jian (J)

Peng Gao (P)

Minxuan Liu (M)

Zhixin Mu (Z)

Articles similaires

Exploring the complexity of genome size reduction in angiosperms.

Causal linkage of Graves' disease with aging: Mendelian randomization analysis of telomere length and age-related phenotypes.

A stepwise guide for pangenome development in crop plants: an alfalfa (Medicago sativa) case study.

Unveiling the influences of P fertilization on bioactive compounds and antioxidant activity in grains of four sorghum cultivars.

Classifications MeSH