Exploiting orthology and de novo transcriptome assembly to refine target sequence information.

Animals Gene Expression Profiling / methods Genomics Humans Sequence Analysis, RNA Sequence Homology, Nucleic Acid

Comparative genomics Orthology RNA-Seq Sequence refinement de novo transcriptome assembly

Journal

BMC medical genomics

ISSN: 1755-8794

Titre abrégé: BMC Med Genomics

Pays: England

ID NLM: 101319628

Informations de publication

Date de publication:
23 05 2019

Historique:

received: 07 11 2018

accepted: 08 05 2019

entrez: 25 5 2019

pubmed: 28 5 2019

medline: 31 12 2019

Statut: epublish

Résumé

The ability to generate recombinant drug target proteins is important for drug discovery research as it facilitates the investigation of drug-target-interactions in vitro. To accomplish this, the target's exact protein sequence is required. Public databases, such as Ensembl, UniProt and RefSeq, are extensive protein and nucleotide sequence repositories. However, many sequences for non-human organisms are predicted by computational pipelines and may thus be incomplete or incorrect. This could lead to misinterpreted experimental outcomes due to gaps or errors in orthologous drug target sequences. Transcriptome analysis by RNA-Seq has been established as a standard method for gene expression analysis. Apart from this common application, paired-end RNA-Seq data can also be used to obtain full coverage cDNA sequences via de novo transcriptome assembly. To assess whether de novo transcriptome assemblies can be used to determine a protein's sequence by searching the assembly for a known orthologous sequence, we generated 3 × 6 = 18 tissue specific assemblies (three organs: brain, kidney and liver; six species: human, mouse, rat, dog, pig and cynomolgus monkey). These assemblies and the manually curated human protein sequences from UniProtKB/Swiss-Prot were used in a reciprocal BLAST search to identify best matching hits. We automated and generalised our approach and present the a&o-tool, a workflow which exploits de novo assemblies of paired-end RNA-Seq data and orthology information for target sequence validation and refinement across related species. Furthermore, the a&o-tool extracts best hits' sequences from a reciprocal BLAST search, translates them into protein sequences, computes a multiple sequence alignment and quantifies the refinement. For the three human assemblies we observed a hit rate greater than 60% with 100% sequence coverage and identity. For assemblies from the other species we observed similar hit rates and coverage with highest identities for cynomolgus monkey. In summary, we show how to refine protein sequences using RNA-Seq data and sequence information from closely related species. With the a&o-tool we provide a fully automated pipeline to perform refinement including cDNA translation and multiple sequence alignment for visual inspection. The major prerequisite for applying the a&o-tool is high quality sequencing data.

Sections du résumé

BACKGROUND

METHODS

To assess whether de novo transcriptome assemblies can be used to determine a protein's sequence by searching the assembly for a known orthologous sequence, we generated 3 × 6 = 18 tissue specific assemblies (three organs: brain, kidney and liver; six species: human, mouse, rat, dog, pig and cynomolgus monkey). These assemblies and the manually curated human protein sequences from UniProtKB/Swiss-Prot were used in a reciprocal BLAST search to identify best matching hits. We automated and generalised our approach and present the a&o-tool, a workflow which exploits de novo assemblies of paired-end RNA-Seq data and orthology information for target sequence validation and refinement across related species. Furthermore, the a&o-tool extracts best hits' sequences from a reciprocal BLAST search, translates them into protein sequences, computes a multiple sequence alignment and quantifies the refinement.

RESULTS

For the three human assemblies we observed a hit rate greater than 60% with 100% sequence coverage and identity. For assemblies from the other species we observed similar hit rates and coverage with highest identities for cynomolgus monkey.

CONCLUSIONS

In summary, we show how to refine protein sequences using RNA-Seq data and sequence information from closely related species. With the a&o-tool we provide a fully automated pipeline to perform refinement including cDNA translation and multiple sequence alignment for visual inspection. The major prerequisite for applying the a&o-tool is high quality sequencing data.

Identifiants

DOI: 10.1186/s12920-019-0524-5 PMID: 31122257 PMC: PMC6533699

pubmed: 31122257

doi: 10.1186/s12920-019-0524-5

pii: 10.1186/s12920-019-0524-5

pmc: PMC6533699

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

Références

Nat Protoc. 2009;4(8):1184-91

pubmed: 19617889

Bioinformatics. 2014 Apr 1;30(7):923-30

pubmed: 24227677

Genome Res. 2016 Aug;26(8):1134-44

pubmed: 27252236

Genome Res. 2015 Jun;25(6):918-25

pubmed: 25883319

Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169

pubmed: 27899622

IEEE Trans Vis Comput Graph. 2014 Dec;20(12):1983-92

pubmed: 26356912

Bioinformatics. 2016 Oct 1;32(19):3047-8

pubmed: 27312411

PLoS Comput Biol. 2010 Mar 26;6(3):e1000703

pubmed: 20361041

J Comput Biol. 2012 May;19(5):455-77

pubmed: 22506599

Sci Data. 2017 Dec 12;4:170185

pubmed: 29231921

Nat Methods. 2015 Feb;12(2):115-21

pubmed: 25633503

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45

pubmed: 26553804

Nat Biotechnol. 2010 Dec;28(12):1248-50

pubmed: 21139605

Aging Cell. 2015 Jun;14(3):352-65

pubmed: 25677554

Nucleic Acids Res. 2014 Jan;42(Database issue):D60-7

pubmed: 24163100

Gigascience. 2018 Aug 1;7(8):

pubmed: 30052957

Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761

pubmed: 29155950

Nat Biotechnol. 2017 Apr 11;35(4):316-319

pubmed: 28398311

Nat Protoc. 2013 Aug;8(8):1494-512

pubmed: 23845962

PLoS Comput Biol. 2016 Feb 19;12(2):e1004772

pubmed: 26894997

BMC Bioinformatics. 2008 Jun 13;9:278

pubmed: 18554390

Bioinformatics. 2005 Aug 15;21(16):3439-40

pubmed: 16082012

J Mol Biol. 1990 Oct 5;215(3):403-10

pubmed: 2231712

Nucleic Acids Res. 2004 Mar 19;32(5):1792-7

pubmed: 15034147

Exploiting orthology and de novo transcriptome assembly to refine target sequence information.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Références

Auteurs

Julia F Söllner (JF)

Germán Leparc (G)

Matthias Zwick (M)

Tanja Schönberger (T)

Tobias Hildebrandt (T)

Kay Nieselt (K)

Eric Simon (E)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH