CAARS: comparative assembly and annotation of RNA-Seq data.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
01 07 2019
Historique:
received: 28 06 2017
revised: 13 09 2018
accepted: 16 11 2018
pubmed: 20 11 2018
medline: 12 6 2020
entrez: 20 11 2018
Statut: ppublish

Résumé

RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction. We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses. CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 30452539
pii: 5191702
doi: 10.1093/bioinformatics/bty903
pmc: PMC6596894
doi:

Substances chimiques

RNA 63231-63-0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

2199-2207

Informations de copyright

© The Author(s) 2018. Published by Oxford University Press.

Références

Database (Oxford). 2016 Feb 20;2016:
pubmed: 26896847
Mol Ecol. 2014 Jun;23(11):2699-711
pubmed: 24754676
Bioinformatics. 2017 Sep 1;33(17):2789
pubmed: 28903539
Syst Biol. 2015 Nov;64(6):969-82
pubmed: 26130236
BMC Evol Biol. 2007 Nov 30;7:241
pubmed: 18053139
BMC Genomics. 2016 Jan 14;17:54
pubmed: 26763976
Genome Biol. 2016 Jan 26;17:13
pubmed: 26813401
BMC Evol Biol. 2012 Jun 14;12:88
pubmed: 22697210
Mol Biol Evol. 2014 Nov;31(11):3081-92
pubmed: 25158799
Bioinformatics. 2012 Dec 1;28(23):3150-2
pubmed: 23060610
Nucleic Acids Res. 2016 Jan 4;44(D1):D710-6
pubmed: 26687719
Nat Methods. 2011 Jun;8(6):469-77
pubmed: 21623353
Trends Genet. 2008 Nov;24(11):539-51
pubmed: 18819722
Mol Biol Evol. 2015 Apr;32(4):835-45
pubmed: 25739733
BMC Bioinformatics. 2015 Mar 25;16:98
pubmed: 25887972
Bioinformatics. 2013 May 15;29(10):1250-9
pubmed: 23493323
PLoS Comput Biol. 2009 Jan;5(1):e1000262
pubmed: 19148271
Science. 2015 Jan 23;347(6220):1260419
pubmed: 25613900
Mol Cell Proteomics. 2014 Feb;13(2):397-406
pubmed: 24309898
PLoS One. 2007 Apr 18;2(4):e383
pubmed: 17440619
Nucleic Acids Res. 2002 Jul 15;30(14):3059-66
pubmed: 12136088
Mol Biol Evol. 2016 Sep;33(9):2391-5
pubmed: 27297470
Nat Biotechnol. 2010 May;28(5):511-5
pubmed: 20436464
Mol Ecol Resour. 2014 Mar;14(2):381-92
pubmed: 24119300
BMC Bioinformatics. 2013 Nov 19;14:330
pubmed: 24252138
Nat Biotechnol. 2011 May 15;29(7):644-52
pubmed: 21572440
Brief Bioinform. 2011 Sep;12(5):379-91
pubmed: 21690100
Mol Ecol. 2013 Feb;22(3):620-34
pubmed: 22998089
Nucleic Acids Res. 2014 Jan;42(Database issue):D897-902
pubmed: 24275491
Nat Biotechnol. 2016 May;34(5):525-7
pubmed: 27043002
Mol Ecol Resour. 2016 Mar;16(2):446-58
pubmed: 26358618
PLoS One. 2017 Sep 20;12(9):e0185020
pubmed: 28931057
BMC Genomics. 2016 May 24;17:392
pubmed: 27220689
Genome Biol Evol. 2016 Aug 03;8(7):2155-63
pubmed: 27324918
Mol Ecol. 2016 Mar;25(6):1224-41
pubmed: 26756714
Brief Bioinform. 2017 May 1;18(3):530-536
pubmed: 27013646
Mol Ecol. 2016 Apr;25(7):1478-93
pubmed: 26859844
Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30
pubmed: 24288371
Ecol Lett. 2015 May;18(5):441-50
pubmed: 25808114
Nat Rev Genet. 2009 Jan;10(1):57-63
pubmed: 19015660
PLoS Biol. 2009 May 5;7(5):e1000112
pubmed: 19468303
Bioinformatics. 2009 May 1;25(9):1105-11
pubmed: 19289445
Nat Rev Genet. 2011 Feb;12(2):87-98
pubmed: 21191423
Mol Phylogenet Evol. 2013 Jan;66(1):417-22
pubmed: 23000819
Genome Res. 1999 Sep;9(9):868-77
pubmed: 10508846
Genomics Insights. 2016 Feb 25;9:17-28
pubmed: 26966373
Genome Res. 2013 Feb;23(2):323-30
pubmed: 23132911
Proc Natl Acad Sci U S A. 1998 May 26;95(11):6239-44
pubmed: 9600949
BMC Bioinformatics. 2009 Jun 16;10 Suppl 6:S3
pubmed: 19534752
BMC Bioinformatics. 2009 Dec 15;10:421
pubmed: 20003500

Auteurs

Carine Rey (C)

UnivLyon, Université Claude Bernard Lyon 1, ENS de Lyon, CNRS UMR, INSERM U1210, LBMC, F-69007, Lyon, France.

Philippe Veber (P)

UnivLyon, Université Claude Bernard Lyon 1, CNRS, UMR, LBBE, F-69100, Villeurbanne, France.

Bastien Boussau (B)

UnivLyon, Université Claude Bernard Lyon 1, CNRS, UMR, LBBE, F-69100, Villeurbanne, France.

Marie Sémon (M)

UnivLyon, Université Claude Bernard Lyon 1, ENS de Lyon, CNRS UMR, INSERM U1210, LBMC, F-69007, Lyon, France.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins

Classifications MeSH