The genome assembly and annotation of the cricket Gryllus longicercus.
Journal
Scientific data
ISSN: 2052-4463
Titre abrégé: Sci Data
Pays: England
ID NLM: 101640192
Informations de publication
Date de publication:
28 Jun 2024
28 Jun 2024
Historique:
received:
16
04
2024
accepted:
19
06
2024
medline:
29
6
2024
pubmed:
29
6
2024
entrez:
28
6
2024
Statut:
epublish
Résumé
The order Orthoptera includes insects such as grasshoppers, katydids, and crickets, among which there are important species for ecosystem stability and pollination, as well as research organisms in different fields such as neurobiology, ecology, and evolution. Crickets, with more than 2,400 described species, are emerging as novel model research organisms, for their diversity, worldwide distribution, regeneration capacity, and their characteristic acoustic communication. Here we report the assembly and annotation of the first New World cricket, that of Gryllus longicercus Weissman & Gray 2019. The genome assembly, generated by combining 44.54 Gb of long reads from PacBio and 120.44 Gb of short Illumina reads, has a length of 1.85 Gb. The genome annotation yielded 19,715 transcripts from 14,789 gene models.
Identifiants
pubmed: 38942791
doi: 10.1038/s41597-024-03554-z
pii: 10.1038/s41597-024-03554-z
doi:
Types de publication
Dataset
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
708Informations de copyright
© 2024. The Author(s).
Références
Weissman, D. B. & Gray, D. A. Crickets of the genus Gryllus in the United States (Orthoptera: Gryllidae: Gryllinae). Zootaxa 4705, (2019).
Gray, D. A., Gabel, E., Blankers, T. & Hennig, R. M. Multivariate female preference tests reveal latent perceptual biases. Proc. R. Soc. B Biol. Sci. 283, 20161972 (2016).
doi: 10.1098/rspb.2016.1972
Horch, H. W., Mito, T., Popadic, A., Ohuchi, H., & Noji, S. The Cricket as a Model Organism (Springer 2017).
Mito, T. et al. Cricket: The third domesticated insect. in Current Topics in Developmental Biology vol. 147 291–306 (Academic Press, 2022).
Supple, M. A. & Shapiro, B. Conservation of biodiversity in the genomics era. Genome Biol. 19, 131 (2018).
pubmed: 30205843
pmcid: 6131752
doi: 10.1186/s13059-018-1520-3
Blankers, T., Oh, K. P., Bombarely, A. & Shaw, K. L. The Genomic Architecture of a Rapid Island Radiation: Recombination Rate Variation, Chromosome Structure, and Genome Assembly of the Hawaiian Cricket Laupala. Genetics 209, 1329–1344 (2018).
pubmed: 29875253
pmcid: 6063224
doi: 10.1534/genetics.118.300894
Blankers, T., Oh, K. P., Bombarely, A. & Shaw, K. L. Laupala kohalensis isolate Lakoh051, whole genome shotgun sequencing project. GenBank https://www.ncbi.nlm.nih.gov/nuccore/NNCF00000000.1 (2017).
Pascoal, S. et al. Field cricket genome reveals the footprint of recent, abrupt adaptation in the wild. Evol. Lett. 4, 19–33 (2020).
pubmed: 32055408
doi: 10.1002/evl3.148
Kataoka, K. et al. The Draft Genome Dataset of the Asian Cricket Teleogryllus occipitalis for Molecular Research Toward Entomophagy. Front. Genet. 11, 470 (2020).
pubmed: 32457806
pmcid: 7225344
doi: 10.3389/fgene.2020.00470
Kataoka, K. et al. Teleogryllus occipitalis, whole genome shotgun sequencing project. GenBank http://www.ncbi.nlm.nih.gov/nuccore/BLKR00000000.1 (2020).
Gupta, Y. M. et al. Development of microsatellite markers for the house cricket, Acheta domesticus (Orthoptera: Gryllidae). Biodiversitas J. Biol. Divers. 21, 4094–4099 (2020).
doi: 10.13057/biodiv/d210921
Dossey, A. T. et al. Genome and Genetic Engineering of the House Cricket (Acheta domesticus): A Resource for Sustainable Agriculture. Biomolecules 13, 589 (2023).
pubmed: 37189337
pmcid: 10136058
doi: 10.3390/biom13040589
Dossey, A. T. et al. Acheta domesticus isolate BO2018_Ado_male_adult, whole genome shotgun sequencing project. GenBank https://www.ncbi.nlm.nih.gov/nuccore/JAHLJT000000000.1 (2023).
Ylla, G. et al. Insights into the genomic evolution of insects from cricket genomes. Commun. Biol. 4, 1–12 (2021).
doi: 10.1038/s42003-021-02197-9
Ylla, G. et al. Gryllus bimaculatus strain white eyes, whole genome shotgun sequencing project. GenBank https://www.ncbi.nlm.nih.gov/nuccore/BOPP00000000.1 (2022).
Satoh, A., Takasu, M., Yano, K. & Terai, Y. De novo assembly and annotation of the mangrove cricket genome. BMC Res. Notes 14, 387 (2021).
pubmed: 34627387
pmcid: 8502352
doi: 10.1186/s13104-021-05798-z
Satoh, A., Takasu, M., Yano, K. & Terai, Y. Apteronemobius asahinai, whole genome shotgun sequencing project. GenBank https://www.ncbi.nlm.nih.gov/nuccore/BPSV00000000.1 (2021).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
pubmed: 33526886
pmcid: 7961889
doi: 10.1038/s41592-020-01056-5
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).
pubmed: 35332338
doi: 10.1038/s41587-022-01261-x
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 1–27 (2020).
doi: 10.1186/s13059-020-02134-9
Batut, B. et al. Community-Driven Data Analysis Training for Biology. Cell Syst. 6, 752–758.e1 (2018).
pubmed: 29953864
pmcid: 6296361
doi: 10.1016/j.cels.2018.05.012
Hiltemann, S. et al. Galaxy Training: A powerful framework for teaching! PLoS Comput. Biol. 19, e1010752 (2023).
pubmed: 36622853
pmcid: 9829167
doi: 10.1371/journal.pcbi.1010752
Lariviere, D. et al. VGP assembly pipeline. Galaxy Training Network https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html (2021).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
pubmed: 32188846
pmcid: 7080791
doi: 10.1038/s41467-020-14998-3
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
pubmed: 30497373
pmcid: 6267036
doi: 10.1186/s12859-018-2485-7
Park, B., Choi, E. H. & Hwang, U. W. Gryllus bimaculatus mitochondrion, complete genome. RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_053546.1 (2023).
Torson, A. S., Hicks, A. M. A., Baragar, C. E., Smith, D. & Sinclair, B. J. Gryllus lineaticeps mitochondrion, complete genome. RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_057052.1 (2023).
Torson, A. S., Hicks, A. M. A., Baragar, C. E., Smith, D. & Sinclair, B. J. Gryllus veletis mitochondrion, complete genome. RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_057053.1 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018).
pubmed: 29750242
pmcid: 6137996
doi: 10.1093/bioinformatics/bty191
Lau, M. J. et al. Aedes aegypti isolate YK_2018 mitochondrion, complete genome. GenBank https://www.ncbi.nlm.nih.gov/nuccore/OM214532.1 (2022).
Xiao, B. et al. Blattella germanica mitochondrion, complete genome. RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_012901.1 (2023).
Wan, K. & Celniker, S. Drosophila melanogaster mitochondrion, complete genome. RefSeq https://www.ncbi.nlm.nih.gov/nuccore/NC_024511.2 (2023).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004).
pubmed: 15034147
pmcid: 390337
doi: 10.1093/nar/gkh340
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5(3), e9490 (2010).
pubmed: 20224823
pmcid: 2835736
doi: 10.1371/journal.pone.0009490
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
pubmed: 34320186
pmcid: 8476166
doi: 10.1093/molbev/msab199
Manni, M., Berkeley, M. R., Seppey, M. & Zdobnov, E. M. BUSCO: Assessing Genomic Data Quality and Beyond. Curr. Protoc. 1, e323 (2021).
pubmed: 34936221
doi: 10.1002/cpz1.323
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
pubmed: 26045719
pmcid: 4455052
doi: 10.1186/s13100-015-0041-9
Smit, A., Hubley, R. & Grenn, P. RepeatMasker Open-4.0 (2015).
Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics Bioinforma. 3, lqaa108 (2021).
doi: 10.1093/nargab/lqaa108
Hoff, K. J., Lomsadze, A., Borodovsky, M. & Stanke, M. Whole-Genome Annotation with BRAKER. Methods Mol. Biol. Clifton NJ 1962, 65–95 (2019).
doi: 10.1007/978-1-4939-9173-0_5
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinforma. Oxf. Engl. 32, 767–769 (2016).
doi: 10.1093/bioinformatics/btv661
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
pubmed: 18218656
doi: 10.1093/bioinformatics/btn013
Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 1–11 (2006).
doi: 10.1186/1471-2105-7-62
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
pubmed: 25402007
doi: 10.1038/nmeth.3176
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
pubmed: 19505943
pmcid: 2723002
doi: 10.1093/bioinformatics/btp352
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
pubmed: 31842956
pmcid: 6912988
doi: 10.1186/s13059-019-1910-1
Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Research 9, ISCB Comm J-304 (2020).
pubmed: 32489650
pmcid: 7222033
doi: 10.12688/f1000research.23297.1
Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinforma. 47, 11.12.1–34 (2014).
doi: 10.1002/0471250953.bi1112s47
Iwata, H. & Gotoh, O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 40, e161 (2012).
pubmed: 22848105
pmcid: 3488211
doi: 10.1093/nar/gks708
Bruna, T., Lomsadze, A. & Borodovsky, M. GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistency with Extrinsic Data. BioRxiv Prepr. Serv. Biol. 2023.01.13.524024 (2023).
Gotoh, O. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res. 36, 2630–2638 (2008).
pubmed: 18344523
pmcid: 2377433
doi: 10.1093/nar/gkn105
Kuznetsov, D. et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 51, D445–D451 (2023).
pubmed: 36350662
doi: 10.1093/nar/gkac998
FelixKrueger/TrimGalore: v0.6.10 - add default decompression path. Zenodo https://doi.org/10.5281/zenodo.5127898 (2023).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
pubmed: 31375807
pmcid: 7605509
doi: 10.1038/s41587-019-0201-4
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
pubmed: 2231712
doi: 10.1016/S0022-2836(05)80360-2
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
pubmed: 20003500
pmcid: 2803857
doi: 10.1186/1471-2105-10-421
The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
doi: 10.1093/nar/gkac1052
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 49, D344–D354 (2021).
pubmed: 33156333
doi: 10.1093/nar/gkaa977
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinforma. Oxf. Engl. 30, 1236–1240 (2014).
doi: 10.1093/bioinformatics/btu031
BBMap. SourceForge, https://sourceforge.net/projects/bbmap/ (2023).
Szrajer, S., Gray, D. & Ylla, G. Gryllus longicercus isolate DAG 2021-001, whole genome shotgun sequencing project. Genbank https://identifiers.org/ncbi/insdc:JAZDUA000000000.1 (2024).
Szrajer, S., Ylla, G. & Gray, D. The genome assembly and annotation of the cricket Gryllus longicercus. figshare https://doi.org/10.6084/m9.figshare.26003989.v2 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP485514 (2024).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
pubmed: 31727128
pmcid: 6857279
doi: 10.1186/s13059-019-1832-y
Emms, D. M. & Kelly, S. STRIDE: Species Tree Root Inference from Gene Duplication Events. Mol. Biol. Evol. 34, 3267–3278 (2017).
pubmed: 29029342
pmcid: 5850722
doi: 10.1093/molbev/msx259
Emms, D. M. & Kelly, S. STAG: Species Tree Inference from All Genes. Preprint at http://biorxiv.org/lookup/doi/10.1101/267914 (2018).