GALBA: Genome Annotation with Miniprot and AUGUSTUS.


Journal

bioRxiv : the preprint server for biology
Titre abrégé: bioRxiv
Pays: United States
ID NLM: 101680187

Informations de publication

Date de publication:
10 Apr 2023
Historique:
medline: 24 4 2023
pubmed: 24 4 2023
entrez: 24 04 2023
Statut: epublish

Résumé

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Identifiants

pubmed: 37090650
doi: 10.1101/2023.04.10.536199
pmc: PMC10120627
pii:
doi:

Types de publication

Preprint

Langues

eng

Subventions

Organisme : Wellcome Trust
Pays : United Kingdom
Organisme : NIGMS NIH HHS
ID : R01 GM128145
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010040
Pays : United States

Commentaires et corrections

Type : UpdateIn

Références

Curr Protoc Bioinformatics. 2004 May;Chapter 4:Unit 4.10
pubmed: 18428725
Nucleic Acids Res. 2023 Jan 6;51(D1):D445-D451
pubmed: 36350662
Nucleic Acids Res. 2016 May 19;44(9):e89
pubmed: 26893356
BMC Bioinformatics. 2011 Dec 22;12:491
pubmed: 22192575
Methods Mol Biol. 2018;1704:189-212
pubmed: 29277866
Proc Natl Acad Sci U S A. 2022 Jan 25;119(4):
pubmed: 35042802
Proc Natl Acad Sci U S A. 2020 Apr 28;117(17):9451-9457
pubmed: 32300014
Appl Plant Sci. 2023 Aug 08;11(4):e11533
pubmed: 37601314
Genome Res. 2008 Dec;18(12):1979-90
pubmed: 18757608
Proc Natl Acad Sci U S A. 2015 Nov 10;112(45):13970-5
pubmed: 26483466
Bioinformatics. 2016 May 15;32(10):1559-61
pubmed: 26787666
Nucleic Acids Res. 2022 Jan 7;50(D1):D387-D390
pubmed: 34850094
Curr Protoc. 2021 Dec;1(12):e323
pubmed: 34936221
Bioinformatics. 2021 Apr 1;36(22-23):5291-5298
pubmed: 33325516
Genomics. 2021 Nov;113(6):4173-4183
pubmed: 34774678
Database (Oxford). 2020 Jan 1;2020:
pubmed: 32761142
Bioinformatics. 2008 Mar 1;24(5):637-44
pubmed: 18218656
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531
pubmed: 36408920
Bioinformatics. 2023 Jan 1;39(1):
pubmed: 36648328
Mol Ecol. 2016 Apr;25(8):1769-84
pubmed: 26859767
BMC Bioinformatics. 2018 May 30;19(1):189
pubmed: 29843602
Nature. 2020 Nov;587(7833):240-245
pubmed: 33177664
Nucleic Acids Res. 2005 Nov 28;33(20):6494-506
pubmed: 16314312
Nucleic Acids Res. 1999 Jan 15;27(2):573-80
pubmed: 9862982
Methods Mol Biol. 2019;1962:161-177
pubmed: 31020559
Nucleic Acids Res. 2012 Nov 1;40(20):e161
pubmed: 22848105
Genome Res. 2008 Jan;18(1):188-96
pubmed: 18025269
Nat Biotechnol. 2021 Nov;39(11):1348-1365
pubmed: 34750572
Commun Biol. 2021 Jun 25;4(1):791
pubmed: 34172835
Bioinformatics. 2008 Nov 1;24(21):2438-44
pubmed: 18728043
NAR Genom Bioinform. 2021 Jan 06;3(1):lqaa108
pubmed: 33575650
Curr Protoc Bioinformatics. 2014 Dec 12;48:4.11.1-4.11.39
pubmed: 25501943
Curr Protoc Bioinformatics. 2019 Mar;65(1):e57
pubmed: 30466165
BMC Bioinformatics. 2019 Nov 8;20(1):558
pubmed: 31703556
Bioinformatics. 2021 Apr 1;36(22-23):5514-5515
pubmed: 33258916
NAR Genom Bioinform. 2020 Jun;2(2):lqaa026
pubmed: 32440658
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
BMC Genomics. 2019 Apr 8;20(1):275
pubmed: 30961563
Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333
pubmed: 29686065
Bioinformatics. 2016 Mar 1;32(5):767-9
pubmed: 26559507
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
Nat Rev Genet. 2012 Apr 18;13(5):329-42
pubmed: 22510764
G3 (Bethesda). 2020 Oct 5;10(10):3479-3488
pubmed: 32859687
Mol Ecol. 2021 Dec;30(23):6211-6228
pubmed: 34324751
BMC Bioinformatics. 2004 May 14;5:59
pubmed: 15144565

Auteurs

Tomáš Brůna (T)

US Department of Energy Joint Genome Institute, Berkeley, CA 94720, USA.

Heng Li (H)

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA & Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.

Joseph Guhlin (J)

Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9016, New Zealand.

Daniel Honsel (D)

Institute of Computer Science, University of Göttingen, 37077 Göttingen, Germany.

Steffen Herbold (S)

Faculty for Computer Science and Mathematics, University of Passau, 94032 Passau, Germany.

Mario Stanke (M)

Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.

Natalia Nenasheva (N)

Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.

Matthis Ebel (M)

Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.

Lars Gabriel (L)

Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.

Katharina J Hoff (KJ)

Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany.

Classifications MeSH