Telomere-to-telomere genome assembly of sorghum.
Journal
Scientific data
ISSN: 2052-4463
Titre abrégé: Sci Data
Pays: England
ID NLM: 101640192
Informations de publication
Date de publication:
02 Aug 2024
02 Aug 2024
Historique:
received:
28
11
2023
accepted:
19
07
2024
medline:
3
8
2024
pubmed:
3
8
2024
entrez:
2
8
2024
Statut:
epublish
Résumé
"Cuohu Bazi" (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.
Identifiants
pubmed: 39095379
doi: 10.1038/s41597-024-03664-8
pii: 10.1038/s41597-024-03664-8
doi:
Types de publication
Dataset
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
835Informations de copyright
© 2024. The Author(s).
Références
Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017).
Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).
doi: 10.1038/nature07723
pubmed: 19189423
Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications 9 (2018).
Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics 20 (2019).
Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants 7, 766–773 (2021).
doi: 10.1038/s41477-021-00925-x
pubmed: 34017083
Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology 23 (2023).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2021).
doi: 10.1126/science.abj6987
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022).
Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research 10 (2023).
Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research 10 (2023).
Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal 20, 1373–1386 (2021).
doi: 10.1111/pbi.13816
Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant 16, 1232–1236 (2023).
doi: 10.1016/j.molp.2023.08.003
pubmed: 37553831
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55, 1221–1231 (2023).
doi: 10.1038/s41588-023-01419-6
pubmed: 37322109
pmcid: 10335936
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
doi: 10.1016/j.ymeth.2012.05.001
pubmed: 22652625
Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).
doi: 10.1038/nmeth.2474
pubmed: 23644548
Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 1 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 1–6 (2021).
doi: 10.1038/s41592-020-01056-5
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
doi: 10.1093/bioinformatics/btp324
pubmed: 19451168
pmcid: 2705234
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology 16 (2015).
Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems 3, 95–98 (2016).
doi: 10.1016/j.cels.2016.07.002
pubmed: 27467249
pmcid: 5846465
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 6333 (2017).
doi: 10.1126/science.aal3327
Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 1 (2019).
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications 12 (2021).
Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods 19, 687–695 (2022).
doi: 10.1038/s41592-022-01440-3
pubmed: 35361931
pmcid: 9812399
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods 19, 705–710 (2022).
doi: 10.1038/s41592-022-01457-8
pubmed: 35365778
pmcid: 10510034
Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research 27, 737–746 (2017).
doi: 10.1101/gr.214270.116
pubmed: 28100585
pmcid: 5411768
Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics 4 (2004).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).
doi: 10.1093/nar/gkm286
pubmed: 17485477
pmcid: 1933203
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–i358 (2005).
doi: 10.1093/bioinformatics/bti1018
pubmed: 15961478
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6 (2015).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
doi: 10.1093/nar/27.2.573
pubmed: 9862982
pmcid: 148217
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011).
doi: 10.1038/nbt.1883
pubmed: 21572440
pmcid: 3571712
Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666 (2003).
doi: 10.1093/nar/gkg770
pubmed: 14500829
pmcid: 206470
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360 (2015).
doi: 10.1038/nmeth.3317
pubmed: 25751142
pmcid: 4655817
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).
doi: 10.1186/s13059-019-1910-1
pubmed: 31842956
pmcid: 6912988
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant 15, 1247–1250 (2022).
doi: 10.1016/j.molp.2022.05.014
pubmed: 35655433
Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019).
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 33, W465–W457 (2005).
doi: 10.1093/nar/gki458
pubmed: 15980513
pmcid: 1160219
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
doi: 10.1186/gb-2008-9-1-r7
pubmed: 18190707
pmcid: 2395244
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60 (2015).
doi: 10.1038/nmeth.3176
pubmed: 25402007
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54 (1999).
doi: 10.1093/nar/27.1.49
pubmed: 9847139
pmcid: 148094
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30 (2000).
doi: 10.1093/nar/28.1.27
pubmed: 10592173
pmcid: 102409
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
doi: 10.1093/bioinformatics/btu031
pubmed: 24451626
pmcid: 3998142
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
doi: 10.1093/bioinformatics/bty560
pubmed: 30423086
pmcid: 6129281
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT
doi: 10.1038/s41587-019-0201-4
pubmed: 31375807
pmcid: 7605509
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
doi: 10.1093/bioinformatics/btu638
pubmed: 25260700
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology 12, R22 (2011).
doi: 10.1186/gb-2011-12-3-r22
pubmed: 21410973
pmcid: 3129672
Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488 (2008).
doi: 10.1126/science.1153917
pubmed: 18436778
Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).
doi: 10.1093/bioinformatics/bty191
Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023).
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023).
Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research 10 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare. https://doi.org/10.6084/m9.figshare.24532924.v1 (2024).
Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution 35, 543–548 (2018).
doi: 10.1093/molbev/msx319
pubmed: 29220515