CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure.


Journal

PLoS computational biology
ISSN: 1553-7358
Titre abrégé: PLoS Comput Biol
Pays: United States
ID NLM: 101238922

Informations de publication

Date de publication:
11 2021
Historique:
received: 23 06 2021
accepted: 11 11 2021
revised: 07 12 2021
pubmed: 24 11 2021
medline: 31 12 2021
entrez: 23 11 2021
Statut: epublish

Résumé

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera's within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Identifiants

pubmed: 34813594
doi: 10.1371/journal.pcbi.1009631
pii: PCOMPBIOL-D-21-01174
pmc: PMC8651127
doi:

Substances chimiques

DNA, Complementary 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

e1009631

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

Viruses. 2019 Apr 26;11(5):
pubmed: 31035503
Nucleic Acids Res. 2020 Sep 4;48(15):8320-8331
pubmed: 32749457
Nat Biotechnol. 2011 May 15;29(7):644-52
pubmed: 21572440
PLoS One. 2015 Oct 06;10(10):e0139857
pubmed: 26440104
PeerJ. 2013 Jul 23;1:e113
pubmed: 23904992
Front Genet. 2011 Jul 07;2:46
pubmed: 22303342
Nat Rev Genet. 2016 May 17;17(6):333-51
pubmed: 27184599
Nucleic Acids Res. 2020 Jan 8;48(D1):D682-D688
pubmed: 31691826
Brief Bioinform. 2018 May 1;19(3):404-414
pubmed: 28069635
Curr Genomics. 2019 Jan;20(1):2-15
pubmed: 31015787
Genome Biol. 2019 Dec 16;20(1):278
pubmed: 31842956
Front Genet. 2015 Apr 20;6:149
pubmed: 25941534
Gigascience. 2018 Dec 1;7(12):
pubmed: 30346548
J Comput Biol. 2017 Nov;24(11):1071-1080
pubmed: 28418726
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Gigascience. 2019 Sep 1;8(9):
pubmed: 31494669
Brief Bioinform. 2018 Jan 1;19(1):23-40
pubmed: 27742661
Comput Struct Biotechnol J. 2019 Nov 17;18:9-19
pubmed: 31890139
Bioinformatics. 2008 Aug 15;24(16):1757-64
pubmed: 18567917
Front Genet. 2021 Mar 30;12:642602
pubmed: 33859668
Biol Direct. 2009 Apr 16;4:14
pubmed: 19371405
BMC Genomics. 2016 Jul 27;17:523
pubmed: 27464550
PLoS Comput Biol. 2010 Dec 16;6(12):e1001022
pubmed: 21187908
Nat Biotechnol. 2010 May;28(5):511-5
pubmed: 20436464
Gigascience. 2019 May 1;8(5):
pubmed: 31077315
Bioinformatics. 2014 Aug 1;30(15):2114-20
pubmed: 24695404
Genome Biol. 2014;15(12):550
pubmed: 25516281
Sci Rep. 2019 Jun 5;9(1):8304
pubmed: 31165774
PLoS One. 2020 Aug 10;15(8):e0237455
pubmed: 32777809
Methods Mol Biol. 2019;1851:105-122
pubmed: 30298394
BMC Genomics. 2013 May 14;14:328
pubmed: 23672450
Bioinformatics. 2004 Sep 22;20(14):2317-9
pubmed: 15073015
Mol Ecol Resour. 2012 Sep;12(5):834-45
pubmed: 22540679
Annu Rev Anim Biosci. 2019 Feb 15;7:17-40
pubmed: 30485757
Methods Mol Biol. 2018;1783:121-147
pubmed: 29767360
PLoS Comput Biol. 2020 Nov 12;16(11):e1008325
pubmed: 33180771
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):334-338
pubmed: 30307874
Bioinformatics. 2009 May 1;25(9):1105-11
pubmed: 19289445
BMC Bioinformatics. 2020 Jul 6;21(Suppl 4):249
pubmed: 32631298
Bioinformatics. 2017 Feb 1;33(3):327-333
pubmed: 28172640
Genome Biol. 2016 Oct 19;17(1):213
pubmed: 27760567
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
PLoS One. 2014 Apr 15;9(4):e94825
pubmed: 24736633
Sci Rep. 2016 Feb 17;6:21746
pubmed: 26883533
Mol Ecol Resour. 2021 Jan;21(1):18-29
pubmed: 32180366
Pharmacogenomics J. 2008 Feb;8(1):4-15
pubmed: 17549068
Nat Biotechnol. 2016 May 6;34(5):518-24
pubmed: 27153285
Annu Rev Genomics Hum Genet. 2000;1:41-73
pubmed: 11701624
Nat Methods. 2015 Apr;12(4):357-60
pubmed: 25751142
Nat Rev Genet. 2011 Sep 07;12(10):671-82
pubmed: 21897427
Physiol Genomics. 2019 May 1;51(5):145-158
pubmed: 30875273
J Mol Evol. 1991 Jul;33(1):34-41
pubmed: 1909373
AIDS Res Hum Retroviruses. 2016 Jul;32(7):676-88
pubmed: 26861573
BMC Genomics. 2013 Jul 09;14:465
pubmed: 23837739
PLoS Comput Biol. 2016 Feb 19;12(2):e1004772
pubmed: 26894997
Nat Protoc. 2013 Aug;8(8):1494-512
pubmed: 23845962
Bioinformatics. 2018 Aug 1;34(15):2556-2565
pubmed: 29554215
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
Nat Methods. 2017 Feb;14(2):135-139
pubmed: 27941783
Nat Commun. 2021 Jan 4;12(1):2
pubmed: 33397972
Appl Environ Microbiol. 2006 Sep;72(9):5734-41
pubmed: 16957188
Comput Struct Biotechnol J. 2020 Jun 12;18:1569-1576
pubmed: 32637053
BMC Bioinformatics. 2009 Dec 15;10:421
pubmed: 20003500
BMC Genomics. 2012 Apr 12;13 Suppl 2:S4
pubmed: 22537299
Toxins (Basel). 2018 Jun 19;10(6):
pubmed: 29921759

Auteurs

Raquel Linheiro (R)

CIBIO/InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal.

John Archer (J)

CIBIO/InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal.

Articles similaires

Robotic Surgical Procedures Animals Humans Telemedicine Models, Animal

Odour generalisation and detection dog training.

Lyn Caldicott, Thomas W Pike, Helen E Zulch et al.
1.00
Animals Odorants Dogs Generalization, Psychological Smell

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals TOR Serine-Threonine Kinases Colorectal Neoplasms Colitis Mice

Classifications MeSH