TSEBRA: transcript selector for BRAKER.
Evidence integration
Gene prediction
Genome annotation
Protein homology
Protein-coding genes
RNA-seq
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
25 Nov 2021
25 Nov 2021
Historique:
received:
04
06
2021
accepted:
15
11
2021
entrez:
26
11
2021
pubmed:
27
11
2021
medline:
30
11
2021
Statut:
epublish
Résumé
BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.
Sections du résumé
BACKGROUND
BACKGROUND
BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited.
RESULTS
RESULTS
We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler.
CONCLUSION
CONCLUSIONS
TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.
Identifiants
pubmed: 34823473
doi: 10.1186/s12859-021-04482-0
pii: 10.1186/s12859-021-04482-0
pmc: PMC8620231
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
566Subventions
Organisme : NIGMS NIH HHS
ID : R01 GM128145
Pays : United States
Organisme : NIH HHS
ID : GM128145
Pays : United States
Informations de copyright
© 2021. The Author(s).
Références
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W123-8
pubmed: 23700307
Mol Ecol Resour. 2021 Jan;21(1):327-339
pubmed: 32985129
Nature. 2009 Jan 29;457(7229):551-6
pubmed: 19189423
BMC Bioinformatics. 2011 Dec 22;12:491
pubmed: 22192575
Proc Natl Acad Sci U S A. 2018 May 15;115(20):E4700-E4709
pubmed: 29717040
Nucleic Acids Res. 2005 Nov 28;33(20):6494-506
pubmed: 16314312
Genome Res. 2004 Jan;14(1):142-8
pubmed: 14707176
Nucleic Acids Res. 2003 Oct 1;31(19):5654-66
pubmed: 14500829
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21
pubmed: 21062823
NAR Genom Bioinform. 2021 Jan 06;3(1):lqaa108
pubmed: 33575650
Bioinformatics. 2008 Mar 1;24(5):597-605
pubmed: 18187439
Genome Biol. 2008 Jan 11;9(1):R7
pubmed: 18190707
BMC Plant Biol. 2018 Apr 12;18(1):62
pubmed: 29649979
BMC Bioinformatics. 2021 Apr 20;22(1):205
pubmed: 33879057
BMC Bioinformatics. 2014 Jun 14;15:189
pubmed: 24927652
Nucleic Acids Res. 2019 Jan 8;47(D1):D807-D811
pubmed: 30395283
Nat Commun. 2019 Oct 16;10(1):4702
pubmed: 31619678
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W435-9
pubmed: 16845043
Methods Mol Biol. 2019;1962:65-95
pubmed: 31020555
Bioinformatics. 2008 Mar 1;24(5):637-44
pubmed: 18218656
Bioinformatics. 2005 Sep 15;21(18):3596-603
pubmed: 16076884
Nat Biotechnol. 2011 May 15;29(7):644-52
pubmed: 21572440
BMC Bioinformatics. 2018 May 30;19(1):189
pubmed: 29843602
Nat Biotechnol. 2019 Aug;37(8):907-915
pubmed: 31375807
Nat Rev Genet. 2009 Jan;10(1):57-63
pubmed: 19015660
BMC Genomics. 2015 Feb 26;16:134
pubmed: 25766582
BMC Bioinformatics. 2006 Feb 09;7:62
pubmed: 16469098
BMC Bioinformatics. 2019 Nov 8;20(1):558
pubmed: 31703556
Genome Res. 2008 Dec;18(12):1979-90
pubmed: 18757608
Psychometrika. 1947 Jun;12(2):153-7
pubmed: 20254758
NAR Genom Bioinform. 2020 Jun;2(2):lqaa026
pubmed: 32440658
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
Nucleic Acids Res. 2014 Sep;42(15):e119
pubmed: 24990371
Bioinformatics. 2005 Jun;21 Suppl 1:i57-65
pubmed: 15961499
Nat Biotechnol. 2015 May;33(5):531-7
pubmed: 25893781
Bioinformatics. 2016 Mar 1;32(5):767-9
pubmed: 26559507
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12
pubmed: 15215400
Nucleic Acids Res. 2022 Apr 22;50(7):e37
pubmed: 34928390
Nucleic Acids Res. 2021 Jan 8;49(D1):D92-D96
pubmed: 33196830