Comparative Genome Annotation.


Journal

Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969

Informations de publication

Date de publication:
2024
Historique:
medline: 31 5 2024
pubmed: 31 5 2024
entrez: 31 5 2024
Statut: ppublish

Résumé

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. A large proportion of such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies, differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate either a single target genome or all input genomes simultaneously. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Furthermore, we provide practical advice on genome annotation in general.

Identifiants

pubmed: 38819560
doi: 10.1007/978-1-0716-3838-5_7
doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

IM

Pagination

165-187

Informations de copyright

© 2024. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.

Références

Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA et al (2020) Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21:1–21
doi: 10.1186/s13059-020-02090-4
Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, Zdobnov EM (2023) OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res 51(D1):D445–D451
pubmed: 36350662 doi: 10.1093/nar/gkac998
Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S et al (2016) Ensembl comparative genomics resources. Database
Schmitt-Engel C, Schultheis D, Schwirz J, Ströhlein N, Troelenberg N, Majumdar U, Grossmann D, Richter T, Tech M, Dönitz J, Gerischer L, Theis M, Schild I, Trauner J, Koniszewski ND, Küster E, Kittelmann S, Hu Y, Lehmann S, Siemanowski J, Ulrich J, Panfilio KA, Schröder R, Morgenstern B, Stanke M, Buchhholz F, Frasch M, Roth S, Wimmer EA, Schoppmeier M, Klingler M, Bucher G (2015) The iBeetle large-scale RNAi screen reveals gene functions for insect development and physiology. Nat Commun 6:7822
pubmed: 26215380 doi: 10.1038/ncomms8822
Avila-Herrera A, Pollard KS (2015) Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species. BMC Bioinform 16(1):1–18
doi: 10.1186/s12859-015-0677-y
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J et al (2021) Towards complete and error-free genome assemblies of all vertebrate species. Nature 592(7856):737–746
pubmed: 33911273 pmcid: 8081667 doi: 10.1038/s41586-021-03451-0
Blaxter M, Archibald JM, Childers AK, Coddington JA, Crandall KA, Di Palma F, Durbin R, Edwards SV, Graves JA, Hackett KJ et al (2022) Why sequence all eukaryotes? Proc Natl Acad Sci 119(4):e2115636118
pubmed: 35042801 pmcid: 8795522 doi: 10.1073/pnas.2115636118
Lilue J, Doran AG, Fiddes IT, Abrudan M, Armstrong J, Bennett R, Chow W, Collins J, Collins S, Czechanski A et al. (2018) Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat Genet 50(11):1574–1583
pubmed: 30275530 pmcid: 6205630 doi: 10.1038/s41588-018-0223-8
Smit A, Hubley R (2008–2015) RepeatModeler Open-1.0. http://www.repeatmasker.org
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D (2011) Cactus: algorithms for genome multiple sequence alignment. Genome Res 21(9):1512–1528
pubmed: 21665927 pmcid: 3166836 doi: 10.1101/gr.123356.111
Dobin A, Davis C, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras T (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
pubmed: 23104886 doi: 10.1093/bioinformatics/bts635
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL (2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnol 37(8):907–915
doi: 10.1038/s41587-019-0201-4
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M (2019) Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 20(1):1–13
doi: 10.1186/s13059-019-1910-1
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8(8):1494–1512
pubmed: 23845962 doi: 10.1038/nprot.2013.084
Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644
pubmed: 18218656 doi: 10.1093/bioinformatics/btn013
Solovyev V, Kosarev P, Seledsov I, Vorobyev D (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 7(Suppl 1):S10
pmcid: 1810547 doi: 10.1186/gb-2006-7-s1-s10
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769
pubmed: 26559507 doi: 10.1093/bioinformatics/btv661
Li H (2023) Protein-to-genome alignment with miniprot. Bioinformatics 39(1):btad014
Gremme G (2013) Computational gene structure prediction. PhD thesis, Universität Hamburg
Slater G, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinform 6(1):31
doi: 10.1186/1471-2105-6-31
Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161
pubmed: 22848105 pmcid: 3488211 doi: 10.1093/nar/gks708
ProSplign. http://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html . Accessed 3 Apr 2023
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14:988–995
pubmed: 15123596 pmcid: 479130 doi: 10.1101/gr.1865504
Keller O, Kollmar M, Stanke M, Waack S (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27(6):757–763
pubmed: 21216780 doi: 10.1093/bioinformatics/btr010
Bruna T, Lomsadze A, Borodovsky M (2020) GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform 2(2):lqaa026
Brøuna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M (2021) BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 3(1):lqaa108
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89–e89
pubmed: 26893356 pmcid: 4872089 doi: 10.1093/nar/gkw092
Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275–i282
pubmed: 21685081 pmcid: 3117341 doi: 10.1093/bioinformatics/btr209
Mertsch D, Stanke M (2022) End-to-end learning of evolutionary models to find coding regions in genome alignments. Bioinformatics 38(7):1857–1862
pubmed: 35060608 doi: 10.1093/bioinformatics/btac028
Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S et al (2019) Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res 29(12):2073–2087
pubmed: 31537640 pmcid: 6886504 doi: 10.1101/gr.246462.118
Korf I, Flicek P, Duan D, Brent M (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 1(Suppl. 1):S1–S9
Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov Model. Genome Res 13:496–502
pubmed: 12618381 pmcid: 430255 doi: 10.1101/gr.424203
Richards S, Liu Y, Bettencourt B, Hradecky P, Letovsky S, Nielsen R, Thornton K, Hubisz M, Chen R, Meisel R et al (2005) Comparative genome sequencing of drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res 15(1):1–18
pubmed: 15632085 pmcid: 540289 doi: 10.1101/gr.3059305
Gross SS, Brent MR (2005) Using multiple alignments to improve gene prediction. Proceedings of RECOMB 2005
Gross S, Do C, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology 8(12):R269
pubmed: 18096039 pmcid: 2246271 doi: 10.1186/gb-2007-8-12-r269
Brent M (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73
pubmed: 18087260 doi: 10.1038/nrg2220
Elsik C, Worley K, Bennett A, Beye M, Camara F, Childers C, de Graaf D, Debyser G, Deng J, Devreese B et al (2014) Finding the missing honey bee genes: lessons learned from a genome upgrade. BMC Genom 15(1):86
doi: 10.1186/1471-2164-15-86
Csuros M, Rogozin IB, Koonin EV (2011) A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol 7(9):e1002150
pubmed: 21935348 pmcid: 3174169 doi: 10.1371/journal.pcbi.1002150
Gotoh O, Morita M, Nelson D (2014) Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinform 15(1):189
doi: 10.1186/1471-2105-15-189
König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw494
Nachtweide S (2018) The simultaneous identification of genes in related species. PhD thesis. https://nbn-resolving.org/urn:nbn:de:gbv:9-opus-22204
Hickey G, Paten B, Earl D, Zerbino D, Haussler D (2013) HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics 29(10):1341–1342
pubmed: 23505295 pmcid: 3654707 doi: 10.1093/bioinformatics/btt128
Nguyen N, Hickey G, Raney B, Armstrong J, Clawson H, Zweig A, Karolchik D, Kent W, Haussler D, Paten B (2014) Comparative assembly hubs: web-accessible browsers for comparative genomics. Bioinformatics https://doi.org/10.1093/bioinformatics/btu534
Hiller M, Schaar BT, Indjeian VB, Kingsley DM, Hagey LR, Bejerano G (2012) A “forward genomics” approach links genotype to phenotype using independent phenotypic losses among related species. Cell Rep 2(4):817–823
pubmed: 23022484 pmcid: 3572205 doi: 10.1016/j.celrep.2012.08.032
Goodswen S, Kennedy P, Ellis J (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 7(11):e50609
pubmed: 23226328 pmcid: 3511556 doi: 10.1371/journal.pone.0050609
Lomsadze A, Burns P, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119
pubmed: 24990371 pmcid: 4150757 doi: 10.1093/nar/gku557
Bruna T, Lomsadze A, Borodovsky M (2023) GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv 2023–01
Hoff K, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Research 41(W1):W123–W128
pubmed: 23700307 pmcid: 3692069 doi: 10.1093/nar/gkt418
Holst F, Bolger A, Günther C, Maß J, Kindel F, Triesch S, Kiel N, Saadat N, Ebenhöh O, Usadel B et al (2023) Helixer—de novo prediction of primary eukaryotic gene models combining deep learning and a hidden Markov model. bioRxiv pp 2023–02
Raney B, Dreszer T, Barber G, Clawson H, Fujita P, Wang T, Nguyen N, Paten B, Zweig A, Karolchik D, Kent W (2013) Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics 30(7):1003–1005
pubmed: 24227676 pmcid: 3967101 doi: 10.1093/bioinformatics/btt637
Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41(D1):D358–D365
pubmed: 23180791 pmcid: 3531149 doi: 10.1093/nar/gks1116
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) Busco: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212
pubmed: 26059717 doi: 10.1093/bioinformatics/btv351
Nevers Y, Rossier V, Train C, Altenhoff AM, Dessimoz C, Glover N (2022) Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv pp 2022–11
Skinner M, Uzilov A, Stein L, Mungall C, Holmes I (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638
pubmed: 19570905 pmcid: 2752129 doi: 10.1101/gr.094607.109
Pirovano W, Boetzer M, Derks MF, Smit S (2017) NCBI-compliant genome submissions: tips and tricks to save time and money. Briefings Bioinform 18(2):179–182
Karasikov M, Mustafa H, Danciu D, Barber C, Zimmermann M, Rätsch G, Kahles A (2020) Metagraph: Indexing and analysing nucleotide archives at petabase-scale. BioRxiv pp 2020–10
Jonkheer EM, van Workum DJM, Sheikhizadeh Anari S, Brankovics B, de Haan JR, Berke L, van der Lee TA, de Ridder D, Smit S (2022) PanTools v3: functional annotation, classification and phylogenomics. Bioinformatics 38(18):4403–4405
pubmed: 35861394 pmcid: 9477522 doi: 10.1093/bioinformatics/btac506

Auteurs

Stefanie Nachtweide (S)

Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany.

Lars Romoth (L)

, Greifswald, Germany.

Mario Stanke (M)

Institute for Mathematics and Computer Science, Greifswald, Germany. mario.stanke@uni-greifswald.de.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins

Classifications MeSH