Telomere-to-telomere genome assembly of sorghum.


Journal

Scientific data
ISSN: 2052-4463
Titre abrégé: Sci Data
Pays: England
ID NLM: 101640192

Informations de publication

Date de publication:
02 Aug 2024
Historique:
received: 28 11 2023
accepted: 19 07 2024
medline: 3 8 2024
pubmed: 3 8 2024
entrez: 2 8 2024
Statut: epublish

Résumé

"Cuohu Bazi" (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.

Identifiants

pubmed: 39095379
doi: 10.1038/s41597-024-03664-8
pii: 10.1038/s41597-024-03664-8
doi:

Types de publication

Dataset Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

835

Informations de copyright

© 2024. The Author(s).

Références

Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017).
Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).
doi: 10.1038/nature07723 pubmed: 19189423
Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications 9 (2018).
Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics 20 (2019).
Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants 7, 766–773 (2021).
doi: 10.1038/s41477-021-00925-x pubmed: 34017083
Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology 23 (2023).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2021).
doi: 10.1126/science.abj6987
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022).
Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research 10 (2023).
Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research 10 (2023).
Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal 20, 1373–1386 (2021).
doi: 10.1111/pbi.13816
Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant 16, 1232–1236 (2023).
doi: 10.1016/j.molp.2023.08.003 pubmed: 37553831
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55, 1221–1231 (2023).
doi: 10.1038/s41588-023-01419-6 pubmed: 37322109 pmcid: 10335936
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
doi: 10.1016/j.ymeth.2012.05.001 pubmed: 22652625
Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).
doi: 10.1038/nmeth.2474 pubmed: 23644548
Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 1 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 1–6 (2021).
doi: 10.1038/s41592-020-01056-5
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
doi: 10.1093/bioinformatics/btp324 pubmed: 19451168 pmcid: 2705234
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology 16 (2015).
Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems 3, 95–98 (2016).
doi: 10.1016/j.cels.2016.07.002 pubmed: 27467249 pmcid: 5846465
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 6333 (2017).
doi: 10.1126/science.aal3327
Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 1 (2019).
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications 12 (2021).
Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods 19, 687–695 (2022).
doi: 10.1038/s41592-022-01440-3 pubmed: 35361931 pmcid: 9812399
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods 19, 705–710 (2022).
doi: 10.1038/s41592-022-01457-8 pubmed: 35365778 pmcid: 10510034
Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research 27, 737–746 (2017).
doi: 10.1101/gr.214270.116 pubmed: 28100585 pmcid: 5411768
Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics 4 (2004).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).
doi: 10.1093/nar/gkm286 pubmed: 17485477 pmcid: 1933203
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–i358 (2005).
doi: 10.1093/bioinformatics/bti1018 pubmed: 15961478
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6 (2015).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
doi: 10.1093/nar/27.2.573 pubmed: 9862982 pmcid: 148217
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011).
doi: 10.1038/nbt.1883 pubmed: 21572440 pmcid: 3571712
Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666 (2003).
doi: 10.1093/nar/gkg770 pubmed: 14500829 pmcid: 206470
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360 (2015).
doi: 10.1038/nmeth.3317 pubmed: 25751142 pmcid: 4655817
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).
doi: 10.1186/s13059-019-1910-1 pubmed: 31842956 pmcid: 6912988
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant 15, 1247–1250 (2022).
doi: 10.1016/j.molp.2022.05.014 pubmed: 35655433
Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019).
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 33, W465–W457 (2005).
doi: 10.1093/nar/gki458 pubmed: 15980513 pmcid: 1160219
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
doi: 10.1186/gb-2008-9-1-r7 pubmed: 18190707 pmcid: 2395244
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60 (2015).
doi: 10.1038/nmeth.3176 pubmed: 25402007
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54 (1999).
doi: 10.1093/nar/27.1.49 pubmed: 9847139 pmcid: 148094
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30 (2000).
doi: 10.1093/nar/28.1.27 pubmed: 10592173 pmcid: 102409
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
doi: 10.1093/bioinformatics/btu031 pubmed: 24451626 pmcid: 3998142
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
doi: 10.1093/bioinformatics/bty560 pubmed: 30423086 pmcid: 6129281
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT
doi: 10.1038/s41587-019-0201-4 pubmed: 31375807 pmcid: 7605509
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
doi: 10.1093/bioinformatics/btu638 pubmed: 25260700
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology 12, R22 (2011).
doi: 10.1186/gb-2011-12-3-r22 pubmed: 21410973 pmcid: 3129672
Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488 (2008).
doi: 10.1126/science.1153917 pubmed: 18436778
Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).
doi: 10.1093/bioinformatics/bty191
Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023).
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023).
Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research 10 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare. https://doi.org/10.6084/m9.figshare.24532924.v1 (2024).
Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution 35, 543–548 (2018).
doi: 10.1093/molbev/msx319 pubmed: 29220515

Auteurs

Meng Li (M)

Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China. nkypzslm@163.com.

Chunhai Chen (C)

BGI Genomics, Shenzhen, 518083, China.

Haigang Wang (H)

Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.

Huibin Qin (H)

Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.

Sen Hou (S)

Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.

Xukui Yang (X)

BGI Genomics, Shenzhen, 518083, China.

Jianbo Jian (J)

BGI Genomics, Shenzhen, 518083, China.

Peng Gao (P)

BGI, Shenzhen, 518083, China. gaopeng@genomics.cn.

Minxuan Liu (M)

Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China. liuminxuan@caas.cn.

Zhixin Mu (Z)

Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China. muzx2008@sina.com.

Articles similaires

Genome Size Genome, Plant Magnoliopsida Evolution, Molecular Arabidopsis
Humans Mendelian Randomization Analysis Graves Disease Aging Genome-Wide Association Study
Genome, Plant Medicago sativa Crops, Agricultural Genomics Polyploidy
Sorghum Antioxidants Phosphorus Fertilizers Flavonoids

Classifications MeSH