Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields.

Algorithms Computational Biology / methods Genomics / methods High-Throughput Nucleotide Sequencing / methods Humans

De Bruijn graphs Next-generation sequencing Probabilistic graphical models

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
14 Sep 2020

Historique:

received: 18 06 2020

accepted: 04 09 2020

entrez: 15 9 2020

pubmed: 16 9 2020

medline: 21 10 2020

Statut: epublish

Résumé

De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner.

CONCLUSIONS CONCLUSIONS

We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F

Identifiants

DOI: 10.1186/s12859-020-03740-x PMID: 32928110 PMC: PMC7491180

pubmed: 32928110

doi: 10.1186/s12859-020-03740-x

pii: 10.1186/s12859-020-03740-x

pmc: PMC7491180

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

402

Références

BMC Genomics. 2013;14 Suppl 1:S7

pubmed: 23368723

Genome Res. 2008 Feb;18(2):324-30

pubmed: 18083777

Bioinformatics. 2005 Sep 1;21 Suppl 2:ii79-85

pubmed: 16204131

Bioinformatics. 2014 Dec 15;30(24):3506-14

pubmed: 25165095

BMC Bioinformatics. 2017 Aug 18;18(1):374

pubmed: 28821237

Nat Genet. 2014 Aug;46(8):912-918

pubmed: 25017105

Genome Res. 2017 Jan;27(1):157-164

pubmed: 27903644

PLoS Comput Biol. 2017 Jun 8;13(6):e1005595

pubmed: 28594827

Brief Bioinform. 2013 Jan;14(1):56-66

pubmed: 22492192

Genome Res. 2008 May;18(5):821-9

pubmed: 18349386

PLoS Genet. 2019 Mar 28;15(3):e1008004

pubmed: 30921322

J Antimicrob Chemother. 2014 May;69(5):1275-81

pubmed: 24370932

J Comput Biol. 2009 Aug;16(8):1101-16

pubmed: 19645596

J Comput Biol. 2012 May;19(5):455-77

pubmed: 22506599

Nat Biotechnol. 2012 Jul 01;30(7):693-700

pubmed: 22750884

Nat Genet. 2012 Jan 08;44(2):226-32

pubmed: 22231483

Bioinformatics. 2016 Jun 15;32(12):i201-i208

pubmed: 27307618

Bioinformatics. 2001;17 Suppl 1:S225-33

pubmed: 11473013

Bioinformatics. 2016 Apr 1;32(7):1009-15

pubmed: 26589280

Bioinformatics. 2020 Mar 1;36(5):1374-1381

pubmed: 30785192

Annu Rev Genomics Hum Genet. 2015;16:153-72

pubmed: 25939056

Commun Biol. 2018 Mar 22;1:20

pubmed: 30271907

Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53

pubmed: 11504945

Genome Biol. 2010;11(11):R116

pubmed: 21114842

Algorithms Mol Biol. 2016 May 03;11:10

pubmed: 27148393

Science. 2000 Mar 24;287(5461):2196-204

pubmed: 10731133

Bioinformatics. 2018 Dec 15;34(24):4213-4222

pubmed: 29955770

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Références

Auteurs

Aranka Steyaert (A)

Pieter Audenaert (P)

Jan Fostier (J)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Classifications MeSH