De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers.


Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
01 05 2019
Historique:
received: 14 08 2018
revised: 21 12 2018
accepted: 09 03 2019
entrez: 12 5 2019
pubmed: 12 5 2019
medline: 24 12 2019
Statut: ppublish

Résumé

In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly. Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets. We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.

Sections du résumé

BACKGROUND
In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly.
RESULTS
Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets.
CONCLUSIONS
We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.

Identifiants

pubmed: 31077315
pii: 5488105
doi: 10.1093/gigascience/giz039
pmc: PMC6511074
pii:
doi:

Types de publication

Comparative Study Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press.

Références

Genome Res. 2016 Aug;26(8):1134-44
pubmed: 27252236
Bioinformatics. 2015 Oct 1;31(19):3210-2
pubmed: 26059717
Arch Virol Suppl. 1993;7:81-100
pubmed: 8219816
Genome Biol. 2015 Feb 11;16:30
pubmed: 25723335
Genome Res. 2004 Jun;14(6):1147-59
pubmed: 15140833
PLoS One. 2016 Apr 07;11(4):e0153104
pubmed: 27054874
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12
pubmed: 25348405
Mol Biol Evol. 2018 Mar 1;35(3):543-548
pubmed: 29220515
Genome Biol. 2014 Dec 21;15(12):553
pubmed: 25608678
Nat Methods. 2010 Nov;7(11):909-12
pubmed: 20935650
Gigascience. 2019 Sep 1;8(9):
pubmed: 31494669
Sci Rep. 2016 Oct 07;6:34589
pubmed: 27713552
Front Genet. 2014 Jun 25;5:190
pubmed: 25009556
PLoS One. 2014 Dec 31;9(12):e115055
pubmed: 25551607
BMC Bioinformatics. 2011 Dec 14;12 Suppl 14:S2
pubmed: 22373417
Nat Commun. 2017 Jul 5;8(1):59
pubmed: 28680106
J Bacteriol. 2015 Jan 1;197(1):18-28
pubmed: 25266388
Nat Biotechnol. 2010 May;28(5):511-5
pubmed: 20436464
Gigascience. 2019 May 1;8(5):
pubmed: 31077315
Gigascience. 2012 Dec 27;1(1):18
pubmed: 23587118
Nat Biotechnol. 2011 May 15;29(7):644-52
pubmed: 21572440
Algorithms Mol Biol. 2017 Feb 22;12:2
pubmed: 28250805
Sci China Life Sci. 2011 Dec;54(12):1129-33
pubmed: 22227905
G3 (Bethesda). 2015 Jan 29;5(4):497-505
pubmed: 25636313
BMC Genomics. 2010 Oct 16;11:571
pubmed: 20950480
Sci China Life Sci. 2013 Feb;56(2):143-55
pubmed: 23393030
Sci China Life Sci. 2013 Feb;56(2):156-62
pubmed: 23393031
Genome Res. 2008 May;18(5):821-9
pubmed: 18349386
Nat Rev Genet. 2009 Jan;10(1):57-63
pubmed: 19015660
Nat Biotechnol. 2010 May;28(5):421-3
pubmed: 20458303
Bioinformatics. 2017 Feb 1;33(3):327-333
pubmed: 28172640
Bioinformatics. 2013 Jul 01;29(13):i326-34
pubmed: 23813001
Bioinformatics. 2012 Apr 15;28(8):1086-92
pubmed: 22368243
J Comput Biol. 2012 May;19(5):455-77
pubmed: 22506599
Nat Methods. 2015 Apr;12(4):357-60
pubmed: 25751142
Nucleic Acids Res. 2012 Jan;40(Database issue):D84-90
pubmed: 22086963
Wiley Interdiscip Rev RNA. 2017 Jan;8(1):
pubmed: 27198714
Nat Rev Genet. 2011 Sep 07;12(10):671-82
pubmed: 21897427
Nat Commun. 2014;5:3064
pubmed: 24451981
Genome Biol. 2016 Jan 26;17:13
pubmed: 26813401
Bioinformatics. 2011 Mar 15;27(6):863-4
pubmed: 21278185
PLoS Comput Biol. 2016 Feb 19;12(2):e1004772
pubmed: 26894997
Bioinformatics. 2019 May 1;35(9):1613-1614
pubmed: 30247621
Nat Methods. 2017 Apr;14(4):417-419
pubmed: 28263959
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
Nucleic Acids Res. 2012 Nov 1;40(20):10073-83
pubmed: 22962361
Bioinformatics. 2014 Jun 15;30(12):1660-6
pubmed: 24532719
Bioinformatics. 2016 Jul 15;32(14):2210-2
pubmed: 27153654
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712

Auteurs

Martin Hölzer (M)

RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany.
European Virus Bioinformatics Center, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany.

Manja Marz (M)

RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany.
European Virus Bioinformatics Center, Friedrich Schiller University, Leutragraben 1, 07743 Jena, Germany.
FLI Leibniz Institute for Age Research, Beutenbergstraße 11, 07743 Jena, Germany.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH