Galba: genome annotation with miniprot and AUGUSTUS.

Animals Eukaryota Eukaryotic Cells Molecular Sequence Annotation Transcriptome

AUGUSTUS Gene prediction Miniprot Protein coding gene

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
31 Aug 2023

Historique:

received: 18 04 2023

accepted: 21 08 2023

medline: 4 9 2023

pubmed: 1 9 2023

entrez: 31 8 2023

Statut: epublish

Résumé

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.

CONCLUSIONS CONCLUSIONS

Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Identifiants

DOI: 10.1186/s12859-023-05449-z PMID: 37653395 PMC: PMC10472564

pubmed: 37653395

doi: 10.1186/s12859-023-05449-z

pii: 10.1186/s12859-023-05449-z

pmc: PMC10472564

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

327

Subventions

Organisme : NHGRI NIH HHS

ID : R01 HG010040

Pays : United States

Organisme : NIH HHS

ID : R01HG010040

Pays : United States

Commentaires et corrections

Type : UpdateOf

Informations de copyright

Références

Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115(17):4325–33.

pubmed: 29686065 pmcid: 5924910 doi: 10.1073/pnas.1720115115

Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.

pubmed: 34750572 pmcid: 8988251 doi: 10.1038/s41587-021-01108-x

Lawniczak MK, Durbin R, Flicek P, Lindblad-Toh K, Wei X, Archibald JM, Baker WJ, Belov K, Blaxter ML, Marques Bonet T, et al. Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci. 2022;119(4):2115639118.

doi: 10.1073/pnas.2115639118

Hope H, Willis S, Markie M, Elliott L. Wellcome Open Research. https://wellcomeopenresearch.org/browse/articles Accessed Accessed 10 April 2023. 2023.

for Biotechnology Information NC. NCBI Genomes. https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/ Accessed Accessed 10 April 2023. 2023.

Gabriel L, Hoff KJ, Bruna T, Lomsadze A, Borodovsky M, Stanke M. The BRAKER3 genome annotation pipeline. Plant and Animal Genomes Conference. 2023;30.

Bruna T, Lomsadze A, Borodovsky M. GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv. 2023. https://doi.org/10.1101/2023.01.13.524024 .

Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44.

pubmed: 18218656 doi: 10.1093/bioinformatics/btn013

Hoff KJ, Stanke M. Predicting genes in single genomes with AUGUSTUS. Curr Protoc Bioinform. 2019;65(1):57.

doi: 10.1002/cpbi.57

Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, Zdobnov EM. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51(D1):445–51.

doi: 10.1093/nar/gkac998

Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5(1):1–9.

doi: 10.1186/1471-2105-5-59

Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18(12):1979–90.

pubmed: 18757608 pmcid: 2593577 doi: 10.1101/gr.081612.108

Bruna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020;2(2):026.

Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33(20):6494–506.

pubmed: 16314312 pmcid: 1298918 doi: 10.1093/nar/gki937

Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.

pubmed: 18025269 pmcid: 2134774 doi: 10.1101/gr.6743907

Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12(1):1–14.

doi: 10.1186/1471-2105-12-491

Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48(1):4–11.

doi: 10.1002/0471250953.bi0411s48

FunAnnotate. 2023. https://github.com/nextgenusfs/funannotate Accessed Accessed 10 April 2023.

Ranz JM, González PM, Clifton BD, Nazario-Yepiz NO, Hernández-Cervantes PL, Palma-Martínez MJ, Valdivia DI, Jiménez-Kaufman A, Lu MM, Markow TA, et al. A de novo transcriptional atlas in danaus plexippus reveals variability in dosage compensation across tissues. Commun Biol. 2021;4(1):791.

pubmed: 34172835 pmcid: 8233437 doi: 10.1038/s42003-021-02335-3

Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3(1):108.

doi: 10.1093/nargab/lqaa108

Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60.

pubmed: 25402007 doi: 10.1038/nmeth.3176

Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics. 2008;24(21):2438–44.

pubmed: 18728043 doi: 10.1093/bioinformatics/btn460

Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012;40(20):161–161.

doi: 10.1093/nar/gks708

Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39(1):014.

doi: 10.1093/bioinformatics/btad014

Gabriel L, Hoff KJ, Bruna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinform. 2021;22(1):1–12.

doi: 10.1186/s12859-021-04482-0

Bruna T. miniprothint. https://github.com/tomasbruna/miniprothint.git Accessed Accessed 10 April 2023. 2023.

Gabriel L, Bruna T, Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. BRAKER. https://github.com/Gaius-Augustus/BRAKER . Accessed 10 April 2023. 2023.

Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13(5):329–42.

pubmed: 22510764 doi: 10.1038/nrg3174

Bruna T. OrthoDB-clades. https://github.com/tomasbruna/orthodb-clades Accessed Accessed 10 April 2023. 2023.

Manni M, Berkeley MR, Seppey M, Zdobnov EM. BUSCO: assessing genomic data quality and beyond. Curr Protoc. 2021;1(12):323.

doi: 10.1002/cpz1.323

Harrop TW, Guhlin J, McLaughlin GM, Permina E, Stockwell P, Gilligan J, Le Lec MF, Gruber MA, Quinn O, Lovegrove M, et al. High-quality assemblies for three invasive social wasps from the Vespula genus. G3: Genes Genom Genet. 2020;10(10):3479–88.

doi: 10.1534/g3.120.401579

Standage DS, Berens AJ, Glastad KM, Severin AJ, Brendel VP, Toth AL. Genome, transcriptome and methylome sequencing of a primitively eusocial wasp reveal a greatly reduced dna methylation system in a social insect. Mol Ecol. 2016;25(8):1769–84.

pubmed: 26859767 doi: 10.1111/mec.13578

Uniprot. The universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):523–31.

O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733–45.

doi: 10.1093/nar/gkv1189

Wallberg A, Bunikis I, Pettersson OV, Mosbech M-B, Childers AK, Evans JD, Mikheyev AS, Robertson HM, Robinson GE, Webster MT. A hybrid de novo genome assembly of the honeybee, apis mellifera, with chromosome-length scaffolds. BMC Genom. 2019;20:1–19.

doi: 10.1186/s12864-019-5642-0

Patalano S, Vlasova A, Wyatt C, Ewels P, Camara F, Ferreira PG, Asher CL, Jurkowski TP, Segonds-Pichon A, Bachman M, et al. Molecular signatures of plastic phenotypes in two eusocial insect species with simple societies. Proc Natl Acad Sci. 2015;112(45):13970–5.

pubmed: 26483466 pmcid: 4653166 doi: 10.1073/pnas.1515937112

Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y. GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics. 2016;32(10):1559–61.

pubmed: 26787666 pmcid: 4866521 doi: 10.1093/bioinformatics/btw015

A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587(7833):240–245.

Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50(D1):387–90.

doi: 10.1093/nar/gkab1053

Nevers Y, Rossier V, Train C, Altenhoff AM, Dessimoz C, Glover N. Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv, 2022;2022–11.

Guo C, Wang Y, Yang A, He J, Xiao C, Lv S, Han F, Yuan Y, Yuan Y, Dong X, et al. The coix genome provides insights into panicoideae evolution and papery hull domestication. Mol Plant. 2020;13(2):309–20.

pubmed: 31778843 doi: 10.1016/j.molp.2019.11.008

Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, et al. Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes. bioRxiv. 2022. https://doi.org/10.1101/2022.10.03.510643

Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 2016;44(9):89–89.

doi: 10.1093/nar/gkw092

Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinform. 2018;19:1–12.

doi: 10.1186/s12859-018-2203-5

Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods Protoc. 2019;161–177.

Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32(5):767–9.

pubmed: 26559507 doi: 10.1093/bioinformatics/btv661

Errbii M, Keilwagen J, Hoff KJ, Steffen R, Altmüller J, Oettler J, Schrader L. Transposable elements and introgression introduce genetic variation in the invasive ant Cardiocondyla obscurior. Mol Ecol. 2021;30(23):6211–28.

pubmed: 34324751 doi: 10.1111/mec.16099

Wöhner TW, Emeriewen OF, Wittenberg AH, Schneiders H, Vrijenhoek I, Halász J, Hrotkó K, Hoff KJ, Gabriel L, Lempe J, et al. The draft chromosome-level genome assembly of tetraploid ground cherry (Prunus fruticosa Pall.) from long reads. Genomics. 2021;113(6):4173–83.

pubmed: 34774678 doi: 10.1016/j.ygeno.2021.11.002

Woehner TW, Emeriewen OF, Wittenberg AH, Nijbroek K, Wang RP, Blom E-J, Keilwagen J, Berner T, Hoff KJ, Gabriel L, et al. The structure of the tetraploid sour cherry ’Schattenmorelle’(Prunus cerasus L.) genome reveals insights into its segmental allopolyploid nature. bioRxiv, 2023–03;2023.

Stiehler F, Steinborn M, Scholz S, Dey D, Weber AP, Denton AK. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2020;36(22–23):5291–8.

pmcid: 8016489

Martin R, Hackl T, Hattab G, Fischer MG, Heider D. Mosga: modular open-source genome annotator. Bioinformatics. 2020;36(22–23):5514–5.

Bruna T. EukSpecies-BRAKER2. https://github.com/gatech-genemark/EukSpecies-BRAKER2 . Accessed 10 April 2023. 2023.

Bruna T. BRAKER2-exp. https://github.com/gatech-genemark/BRAKER2-exp . Accessed 10 April 2023. 2023.

Bruna T. GeneMark-ETP-exp. https://github.com/gatech-genemark/GeneMark-ETP-exp . Accessed 10 April 2023. 2023.

Stanke M, Bruhn W, Becker F, Hoff KJ. VARUS: sampling complementary RNA reads from the sequence read archive. BMC Bioinform. 2019;20:1–7.

doi: 10.1186/s12859-019-3182-x

Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O’Neill K, Robbertse B, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:baaa062.

pubmed: 32761142 pmcid: 7408187 doi: 10.1093/database/baaa062

König S, Romoth L, Stanke M. Comparative genome annotation. Comp Genom Methods Protoc 2018;189–212.

Bruna T. Unsupervised algorithms for automated gene prediction in novel eukaryotic genomes. Ph.D thesis, Georgia Institute of Technology. 2022.

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci. 2020;117(17):9451–7.

pubmed: 32300014 pmcid: 7196820 doi: 10.1073/pnas.1921046117

Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr Protoc Bioinform. 2004;5(1):4–10.

doi: 10.1002/0471250953.bi0410s05

Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80.

pubmed: 9862982 pmcid: 148217 doi: 10.1093/nar/27.2.573

Galba: genome annotation with miniprot and AUGUSTUS.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Commentaires et corrections

Informations de copyright

Références

Auteurs

Tomáš Brůna (T)

Heng Li (H)

Joseph Guhlin (J)

Daniel Honsel (D)

Steffen Herbold (S)

Mario Stanke (M)

Natalia Nenasheva (N)

Matthis Ebel (M)

Lars Gabriel (L)

Katharina J Hoff (KJ)

Articles similaires

Evaluating the efficacy of telesurgery with dual console SSI Mantra Surgical Robotic System: experiment on animal model and clinical trials.

Odour generalisation and detection dog training.

FBXO22 inhibits colitis and colorectal carcinogenesis by regulating the degradation of the S2448-phosphorylated form of mTOR.

Use of organic material provided by an automatic enrichment device by weaner pigs and its influence on tail lesions.

Classifications MeSH