Galba: genome annotation with miniprot and AUGUSTUS.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
31 Aug 2023
Historique:
received: 18 04 2023
accepted: 21 08 2023
medline: 4 9 2023
pubmed: 1 9 2023
entrez: 31 8 2023
Statut: epublish

Résumé

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Sections du résumé

BACKGROUND BACKGROUND
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.
RESULTS RESULTS
Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.
CONCLUSIONS CONCLUSIONS
Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Identifiants

pubmed: 37653395
doi: 10.1186/s12859-023-05449-z
pii: 10.1186/s12859-023-05449-z
pmc: PMC10472564
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

327

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG010040
Pays : United States
Organisme : NIH HHS
ID : R01HG010040
Pays : United States

Commentaires et corrections

Type : UpdateOf

Informations de copyright

© 2023. BioMed Central Ltd., part of Springer Nature.

Références

Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115(17):4325–33.
pubmed: 29686065 pmcid: 5924910 doi: 10.1073/pnas.1720115115
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
pubmed: 34750572 pmcid: 8988251 doi: 10.1038/s41587-021-01108-x
Lawniczak MK, Durbin R, Flicek P, Lindblad-Toh K, Wei X, Archibald JM, Baker WJ, Belov K, Blaxter ML, Marques Bonet T, et al. Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci. 2022;119(4):2115639118.
doi: 10.1073/pnas.2115639118
Hope H, Willis S, Markie M, Elliott L. Wellcome Open Research. https://wellcomeopenresearch.org/browse/articles Accessed Accessed 10 April 2023. 2023.
for Biotechnology Information NC. NCBI Genomes. https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/ Accessed Accessed 10 April 2023. 2023.
Gabriel L, Hoff KJ, Bruna T, Lomsadze A, Borodovsky M, Stanke M. The BRAKER3 genome annotation pipeline. Plant and Animal Genomes Conference. 2023;30.
Bruna T, Lomsadze A, Borodovsky M. GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv. 2023. https://doi.org/10.1101/2023.01.13.524024 .
Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44.
pubmed: 18218656 doi: 10.1093/bioinformatics/btn013
Hoff KJ, Stanke M. Predicting genes in single genomes with AUGUSTUS. Curr Protoc Bioinform. 2019;65(1):57.
doi: 10.1002/cpbi.57
Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, Zdobnov EM. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51(D1):445–51.
doi: 10.1093/nar/gkac998
Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5(1):1–9.
doi: 10.1186/1471-2105-5-59
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18(12):1979–90.
pubmed: 18757608 pmcid: 2593577 doi: 10.1101/gr.081612.108
Bruna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020;2(2):026.
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33(20):6494–506.
pubmed: 16314312 pmcid: 1298918 doi: 10.1093/nar/gki937
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.
pubmed: 18025269 pmcid: 2134774 doi: 10.1101/gr.6743907
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12(1):1–14.
doi: 10.1186/1471-2105-12-491
Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48(1):4–11.
doi: 10.1002/0471250953.bi0411s48
FunAnnotate. 2023. https://github.com/nextgenusfs/funannotate Accessed Accessed 10 April 2023.
Ranz JM, González PM, Clifton BD, Nazario-Yepiz NO, Hernández-Cervantes PL, Palma-Martínez MJ, Valdivia DI, Jiménez-Kaufman A, Lu MM, Markow TA, et al. A de novo transcriptional atlas in danaus plexippus reveals variability in dosage compensation across tissues. Commun Biol. 2021;4(1):791.
pubmed: 34172835 pmcid: 8233437 doi: 10.1038/s42003-021-02335-3
Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3(1):108.
doi: 10.1093/nargab/lqaa108
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60.
pubmed: 25402007 doi: 10.1038/nmeth.3176
Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics. 2008;24(21):2438–44.
pubmed: 18728043 doi: 10.1093/bioinformatics/btn460
Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012;40(20):161–161.
doi: 10.1093/nar/gks708
Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39(1):014.
doi: 10.1093/bioinformatics/btad014
Gabriel L, Hoff KJ, Bruna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinform. 2021;22(1):1–12.
doi: 10.1186/s12859-021-04482-0
Bruna T. miniprothint. https://github.com/tomasbruna/miniprothint.git Accessed Accessed 10 April 2023. 2023.
Gabriel L, Bruna T, Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. BRAKER. https://github.com/Gaius-Augustus/BRAKER . Accessed 10 April 2023. 2023.
Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13(5):329–42.
pubmed: 22510764 doi: 10.1038/nrg3174
Bruna T. OrthoDB-clades. https://github.com/tomasbruna/orthodb-clades Accessed Accessed 10 April 2023. 2023.
Manni M, Berkeley MR, Seppey M, Zdobnov EM. BUSCO: assessing genomic data quality and beyond. Curr Protoc. 2021;1(12):323.
doi: 10.1002/cpz1.323
Harrop TW, Guhlin J, McLaughlin GM, Permina E, Stockwell P, Gilligan J, Le Lec MF, Gruber MA, Quinn O, Lovegrove M, et al. High-quality assemblies for three invasive social wasps from the Vespula genus. G3: Genes Genom Genet. 2020;10(10):3479–88.
doi: 10.1534/g3.120.401579
Standage DS, Berens AJ, Glastad KM, Severin AJ, Brendel VP, Toth AL. Genome, transcriptome and methylome sequencing of a primitively eusocial wasp reveal a greatly reduced dna methylation system in a social insect. Mol Ecol. 2016;25(8):1769–84.
pubmed: 26859767 doi: 10.1111/mec.13578
Uniprot. The universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):523–31.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733–45.
doi: 10.1093/nar/gkv1189
Wallberg A, Bunikis I, Pettersson OV, Mosbech M-B, Childers AK, Evans JD, Mikheyev AS, Robertson HM, Robinson GE, Webster MT. A hybrid de novo genome assembly of the honeybee, apis mellifera, with chromosome-length scaffolds. BMC Genom. 2019;20:1–19.
doi: 10.1186/s12864-019-5642-0
Patalano S, Vlasova A, Wyatt C, Ewels P, Camara F, Ferreira PG, Asher CL, Jurkowski TP, Segonds-Pichon A, Bachman M, et al. Molecular signatures of plastic phenotypes in two eusocial insect species with simple societies. Proc Natl Acad Sci. 2015;112(45):13970–5.
pubmed: 26483466 pmcid: 4653166 doi: 10.1073/pnas.1515937112
Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y. GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics. 2016;32(10):1559–61.
pubmed: 26787666 pmcid: 4866521 doi: 10.1093/bioinformatics/btw015
A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587(7833):240–245.
Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50(D1):387–90.
doi: 10.1093/nar/gkab1053
Nevers Y, Rossier V, Train C, Altenhoff AM, Dessimoz C, Glover N. Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv, 2022;2022–11.
Guo C, Wang Y, Yang A, He J, Xiao C, Lv S, Han F, Yuan Y, Yuan Y, Dong X, et al. The coix genome provides insights into panicoideae evolution and papery hull domestication. Mol Plant. 2020;13(2):309–20.
pubmed: 31778843 doi: 10.1016/j.molp.2019.11.008
Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, et al. Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes. bioRxiv. 2022. https://doi.org/10.1101/2022.10.03.510643
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 2016;44(9):89–89.
doi: 10.1093/nar/gkw092
Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinform. 2018;19:1–12.
doi: 10.1186/s12859-018-2203-5
Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods Protoc. 2019;161–177.
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32(5):767–9.
pubmed: 26559507 doi: 10.1093/bioinformatics/btv661
Errbii M, Keilwagen J, Hoff KJ, Steffen R, Altmüller J, Oettler J, Schrader L. Transposable elements and introgression introduce genetic variation in the invasive ant Cardiocondyla obscurior. Mol Ecol. 2021;30(23):6211–28.
pubmed: 34324751 doi: 10.1111/mec.16099
Wöhner TW, Emeriewen OF, Wittenberg AH, Schneiders H, Vrijenhoek I, Halász J, Hrotkó K, Hoff KJ, Gabriel L, Lempe J, et al. The draft chromosome-level genome assembly of tetraploid ground cherry (Prunus fruticosa Pall.) from long reads. Genomics. 2021;113(6):4173–83.
pubmed: 34774678 doi: 10.1016/j.ygeno.2021.11.002
Woehner TW, Emeriewen OF, Wittenberg AH, Nijbroek K, Wang RP, Blom E-J, Keilwagen J, Berner T, Hoff KJ, Gabriel L, et al. The structure of the tetraploid sour cherry ’Schattenmorelle’(Prunus cerasus L.) genome reveals insights into its segmental allopolyploid nature. bioRxiv, 2023–03;2023.
Stiehler F, Steinborn M, Scholz S, Dey D, Weber AP, Denton AK. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2020;36(22–23):5291–8.
pmcid: 8016489
Martin R, Hackl T, Hattab G, Fischer MG, Heider D. Mosga: modular open-source genome annotator. Bioinformatics. 2020;36(22–23):5514–5.
Bruna T. EukSpecies-BRAKER2. https://github.com/gatech-genemark/EukSpecies-BRAKER2 . Accessed 10 April 2023. 2023.
Bruna T. BRAKER2-exp. https://github.com/gatech-genemark/BRAKER2-exp . Accessed 10 April 2023. 2023.
Bruna T. GeneMark-ETP-exp. https://github.com/gatech-genemark/GeneMark-ETP-exp . Accessed 10 April 2023. 2023.
Stanke M, Bruhn W, Becker F, Hoff KJ. VARUS: sampling complementary RNA reads from the sequence read archive. BMC Bioinform. 2019;20:1–7.
doi: 10.1186/s12859-019-3182-x
Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O’Neill K, Robbertse B, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:baaa062.
pubmed: 32761142 pmcid: 7408187 doi: 10.1093/database/baaa062
König S, Romoth L, Stanke M. Comparative genome annotation. Comp Genom Methods Protoc 2018;189–212.
Bruna T. Unsupervised algorithms for automated gene prediction in novel eukaryotic genomes. Ph.D thesis, Georgia Institute of Technology. 2022.
Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci. 2020;117(17):9451–7.
pubmed: 32300014 pmcid: 7196820 doi: 10.1073/pnas.1921046117
Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr Protoc Bioinform. 2004;5(1):4–10.
doi: 10.1002/0471250953.bi0410s05
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80.
pubmed: 9862982 pmcid: 148217 doi: 10.1093/nar/27.2.573

Auteurs

Tomáš Brůna (T)

U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.

Heng Li (H)

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA.
Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA.

Joseph Guhlin (J)

Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand.

Daniel Honsel (D)

Institute of Computer Science, University of Göttingen, 37077, Göttingen, Germany.

Steffen Herbold (S)

Faculty for Computer Science and Mathematics, University of Passau, 94032, Passau, Germany.

Mario Stanke (M)

Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany.

Natalia Nenasheva (N)

Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany.

Matthis Ebel (M)

Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany.

Lars Gabriel (L)

Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany.

Katharina J Hoff (KJ)

Institute of Mathematics and Computer Science, and Center for Functional Genomics of Microbes, University of Greifswald, 17489, Greifswald, Germany. katharina.hoff@uni-greifswald.de.

Articles similaires

Robotic Surgical Procedures Animals Humans Telemedicine Models, Animal

Odour generalisation and detection dog training.

Lyn Caldicott, Thomas W Pike, Helen E Zulch et al.
1.00
Animals Odorants Dogs Generalization, Psychological Smell
Animals TOR Serine-Threonine Kinases Colorectal Neoplasms Colitis Mice
Animals Tail Swine Behavior, Animal Animal Husbandry

Classifications MeSH