Galba: genome annotation with miniprot and AUGUSTUS.
AUGUSTUS
Gene prediction
Miniprot
Protein coding gene
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
31 Aug 2023
31 Aug 2023
Historique:
received:
18
04
2023
accepted:
21
08
2023
medline:
4
9
2023
pubmed:
1
9
2023
entrez:
31
8
2023
Statut:
epublish
Résumé
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Sections du résumé
BACKGROUND
BACKGROUND
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.
RESULTS
RESULTS
Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.
CONCLUSIONS
CONCLUSIONS
Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Identifiants
pubmed: 37653395
doi: 10.1186/s12859-023-05449-z
pii: 10.1186/s12859-023-05449-z
pmc: PMC10472564
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
327Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG010040
Pays : United States
Organisme : NIH HHS
ID : R01HG010040
Pays : United States
Commentaires et corrections
Type : UpdateOf
Informations de copyright
© 2023. BioMed Central Ltd., part of Springer Nature.
Références
Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115(17):4325–33.
pubmed: 29686065
pmcid: 5924910
doi: 10.1073/pnas.1720115115
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
pubmed: 34750572
pmcid: 8988251
doi: 10.1038/s41587-021-01108-x
Lawniczak MK, Durbin R, Flicek P, Lindblad-Toh K, Wei X, Archibald JM, Baker WJ, Belov K, Blaxter ML, Marques Bonet T, et al. Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci. 2022;119(4):2115639118.
doi: 10.1073/pnas.2115639118
Hope H, Willis S, Markie M, Elliott L. Wellcome Open Research. https://wellcomeopenresearch.org/browse/articles Accessed Accessed 10 April 2023. 2023.
for Biotechnology Information NC. NCBI Genomes. https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/ Accessed Accessed 10 April 2023. 2023.
Gabriel L, Hoff KJ, Bruna T, Lomsadze A, Borodovsky M, Stanke M. The BRAKER3 genome annotation pipeline. Plant and Animal Genomes Conference. 2023;30.
Bruna T, Lomsadze A, Borodovsky M. GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv. 2023. https://doi.org/10.1101/2023.01.13.524024 .
Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44.
pubmed: 18218656
doi: 10.1093/bioinformatics/btn013
Hoff KJ, Stanke M. Predicting genes in single genomes with AUGUSTUS. Curr Protoc Bioinform. 2019;65(1):57.
doi: 10.1002/cpbi.57
Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, Zdobnov EM. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51(D1):445–51.
doi: 10.1093/nar/gkac998
Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5(1):1–9.
doi: 10.1186/1471-2105-5-59
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18(12):1979–90.
pubmed: 18757608
pmcid: 2593577
doi: 10.1101/gr.081612.108
Bruna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020;2(2):026.
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33(20):6494–506.
pubmed: 16314312
pmcid: 1298918
doi: 10.1093/nar/gki937
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.
pubmed: 18025269
pmcid: 2134774
doi: 10.1101/gr.6743907
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12(1):1–14.
doi: 10.1186/1471-2105-12-491
Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48(1):4–11.
doi: 10.1002/0471250953.bi0411s48
FunAnnotate. 2023. https://github.com/nextgenusfs/funannotate Accessed Accessed 10 April 2023.
Ranz JM, González PM, Clifton BD, Nazario-Yepiz NO, Hernández-Cervantes PL, Palma-Martínez MJ, Valdivia DI, Jiménez-Kaufman A, Lu MM, Markow TA, et al. A de novo transcriptional atlas in danaus plexippus reveals variability in dosage compensation across tissues. Commun Biol. 2021;4(1):791.
pubmed: 34172835
pmcid: 8233437
doi: 10.1038/s42003-021-02335-3
Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3(1):108.
doi: 10.1093/nargab/lqaa108
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60.
pubmed: 25402007
doi: 10.1038/nmeth.3176
Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics. 2008;24(21):2438–44.
pubmed: 18728043
doi: 10.1093/bioinformatics/btn460
Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012;40(20):161–161.
doi: 10.1093/nar/gks708
Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39(1):014.
doi: 10.1093/bioinformatics/btad014
Gabriel L, Hoff KJ, Bruna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinform. 2021;22(1):1–12.
doi: 10.1186/s12859-021-04482-0
Bruna T. miniprothint. https://github.com/tomasbruna/miniprothint.git Accessed Accessed 10 April 2023. 2023.
Gabriel L, Bruna T, Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. BRAKER. https://github.com/Gaius-Augustus/BRAKER . Accessed 10 April 2023. 2023.
Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13(5):329–42.
pubmed: 22510764
doi: 10.1038/nrg3174
Bruna T. OrthoDB-clades. https://github.com/tomasbruna/orthodb-clades Accessed Accessed 10 April 2023. 2023.
Manni M, Berkeley MR, Seppey M, Zdobnov EM. BUSCO: assessing genomic data quality and beyond. Curr Protoc. 2021;1(12):323.
doi: 10.1002/cpz1.323
Harrop TW, Guhlin J, McLaughlin GM, Permina E, Stockwell P, Gilligan J, Le Lec MF, Gruber MA, Quinn O, Lovegrove M, et al. High-quality assemblies for three invasive social wasps from the Vespula genus. G3: Genes Genom Genet. 2020;10(10):3479–88.
doi: 10.1534/g3.120.401579
Standage DS, Berens AJ, Glastad KM, Severin AJ, Brendel VP, Toth AL. Genome, transcriptome and methylome sequencing of a primitively eusocial wasp reveal a greatly reduced dna methylation system in a social insect. Mol Ecol. 2016;25(8):1769–84.
pubmed: 26859767
doi: 10.1111/mec.13578
Uniprot. The universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):523–31.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733–45.
doi: 10.1093/nar/gkv1189
Wallberg A, Bunikis I, Pettersson OV, Mosbech M-B, Childers AK, Evans JD, Mikheyev AS, Robertson HM, Robinson GE, Webster MT. A hybrid de novo genome assembly of the honeybee, apis mellifera, with chromosome-length scaffolds. BMC Genom. 2019;20:1–19.
doi: 10.1186/s12864-019-5642-0
Patalano S, Vlasova A, Wyatt C, Ewels P, Camara F, Ferreira PG, Asher CL, Jurkowski TP, Segonds-Pichon A, Bachman M, et al. Molecular signatures of plastic phenotypes in two eusocial insect species with simple societies. Proc Natl Acad Sci. 2015;112(45):13970–5.
pubmed: 26483466
pmcid: 4653166
doi: 10.1073/pnas.1515937112
Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y. GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics. 2016;32(10):1559–61.
pubmed: 26787666
pmcid: 4866521
doi: 10.1093/bioinformatics/btw015
A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587(7833):240–245.
Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50(D1):387–90.
doi: 10.1093/nar/gkab1053
Nevers Y, Rossier V, Train C, Altenhoff AM, Dessimoz C, Glover N. Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv, 2022;2022–11.
Guo C, Wang Y, Yang A, He J, Xiao C, Lv S, Han F, Yuan Y, Yuan Y, Dong X, et al. The coix genome provides insights into panicoideae evolution and papery hull domestication. Mol Plant. 2020;13(2):309–20.
pubmed: 31778843
doi: 10.1016/j.molp.2019.11.008
Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, et al. Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes. bioRxiv. 2022. https://doi.org/10.1101/2022.10.03.510643
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 2016;44(9):89–89.
doi: 10.1093/nar/gkw092
Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinform. 2018;19:1–12.
doi: 10.1186/s12859-018-2203-5
Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods Protoc. 2019;161–177.
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32(5):767–9.
pubmed: 26559507
doi: 10.1093/bioinformatics/btv661
Errbii M, Keilwagen J, Hoff KJ, Steffen R, Altmüller J, Oettler J, Schrader L. Transposable elements and introgression introduce genetic variation in the invasive ant Cardiocondyla obscurior. Mol Ecol. 2021;30(23):6211–28.
pubmed: 34324751
doi: 10.1111/mec.16099
Wöhner TW, Emeriewen OF, Wittenberg AH, Schneiders H, Vrijenhoek I, Halász J, Hrotkó K, Hoff KJ, Gabriel L, Lempe J, et al. The draft chromosome-level genome assembly of tetraploid ground cherry (Prunus fruticosa Pall.) from long reads. Genomics. 2021;113(6):4173–83.
pubmed: 34774678
doi: 10.1016/j.ygeno.2021.11.002
Woehner TW, Emeriewen OF, Wittenberg AH, Nijbroek K, Wang RP, Blom E-J, Keilwagen J, Berner T, Hoff KJ, Gabriel L, et al. The structure of the tetraploid sour cherry ’Schattenmorelle’(Prunus cerasus L.) genome reveals insights into its segmental allopolyploid nature. bioRxiv, 2023–03;2023.
Stiehler F, Steinborn M, Scholz S, Dey D, Weber AP, Denton AK. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2020;36(22–23):5291–8.
pmcid: 8016489
Martin R, Hackl T, Hattab G, Fischer MG, Heider D. Mosga: modular open-source genome annotator. Bioinformatics. 2020;36(22–23):5514–5.
Bruna T. EukSpecies-BRAKER2. https://github.com/gatech-genemark/EukSpecies-BRAKER2 . Accessed 10 April 2023. 2023.
Bruna T. BRAKER2-exp. https://github.com/gatech-genemark/BRAKER2-exp . Accessed 10 April 2023. 2023.
Bruna T. GeneMark-ETP-exp. https://github.com/gatech-genemark/GeneMark-ETP-exp . Accessed 10 April 2023. 2023.
Stanke M, Bruhn W, Becker F, Hoff KJ. VARUS: sampling complementary RNA reads from the sequence read archive. BMC Bioinform. 2019;20:1–7.
doi: 10.1186/s12859-019-3182-x
Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O’Neill K, Robbertse B, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:baaa062.
pubmed: 32761142
pmcid: 7408187
doi: 10.1093/database/baaa062
König S, Romoth L, Stanke M. Comparative genome annotation. Comp Genom Methods Protoc 2018;189–212.
Bruna T. Unsupervised algorithms for automated gene prediction in novel eukaryotic genomes. Ph.D thesis, Georgia Institute of Technology. 2022.
Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci. 2020;117(17):9451–7.
pubmed: 32300014
pmcid: 7196820
doi: 10.1073/pnas.1921046117
Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr Protoc Bioinform. 2004;5(1):4–10.
doi: 10.1002/0471250953.bi0410s05
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80.
pubmed: 9862982
pmcid: 148217
doi: 10.1093/nar/27.2.573