Scalable long read self-correction and assembly polishing with multiple sequence alignment.


Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
12 01 2021
Historique:
received: 26 08 2020
accepted: 22 12 2020
entrez: 13 1 2021
pubmed: 14 1 2021
medline: 26 8 2021
Statut: epublish

Résumé

Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .

Identifiants

pubmed: 33436980
doi: 10.1038/s41598-020-80757-5
pii: 10.1038/s41598-020-80757-5
pmc: PMC7804095
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

761

Références

Bioinformatics. 2019 Oct 15;35(20):3953-3960
pubmed: 30895306
Bioinformatics. 2014 Dec 15;30(24):3506-14
pubmed: 25165095
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Bioinformatics. 2002 Mar;18(3):452-64
pubmed: 11934745
BMC Bioinformatics. 2012 Sep 19;13:238
pubmed: 22988817
Nat Methods. 2013 Jun;10(6):563-9
pubmed: 23644548
Bioinformatics. 2018 Jul 1;34(13):i142-i150
pubmed: 29949969
Genome Res. 2017 May;27(5):747-756
pubmed: 28320918
Bioinformatics. 2017 Mar 15;33(6):799-806
pubmed: 27273673
Nat Methods. 2018 Jun;15(6):461-468
pubmed: 29713083
Nat Commun. 2017 Feb 20;8:14515
pubmed: 28218240
Genome Biol. 2013;14(9):R101
pubmed: 24034426
BMC Bioinformatics. 2017 Apr 5;18(1):204
pubmed: 28381259
Genome Res. 2017 May;27(5):737-746
pubmed: 28100585
J Comput Biol. 2015 Jun;22(6):498-509
pubmed: 25658651
Nucleic Acids Res. 2018 Nov 30;46(21):e125
pubmed: 30124947
BMC Bioinformatics. 2018 Feb 09;19(1):50
pubmed: 29426289
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Sci Rep. 2018 Jul 2;8(1):9936
pubmed: 29967328
Nat Methods. 2017 Nov;14(11):1072-1074
pubmed: 28945707
Bioinformatics. 2016 Sep 1;32(17):2704-6
pubmed: 27166244
BMC Genomics. 2015 Apr 20;16:327
pubmed: 25927464
Bioinformatics. 2016 Sep 1;32(17):i545-i551
pubmed: 27587673
Nat Biotechnol. 2018 Apr;36(4):338-345
pubmed: 29431738
Nat Rev Genet. 2018 Jun;19(6):329-346
pubmed: 29599501
Algorithms Mol Biol. 2016 May 03;11:10
pubmed: 27148393
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
PLoS One. 2013 Dec 04;8(12):e82138
pubmed: 24324759
Bioinformatics. 2018 Dec 15;34(24):4213-4222
pubmed: 29955770

Auteurs

Pierre Morisse (P)

Univ Rennes, Inria, CNRS, IRISA, 35000, Rennes, France. pierre.morisse@inria.fr.

Camille Marchet (C)

Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59000, Lille, France.

Antoine Limasset (A)

Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59000, Lille, France.

Thierry Lecroq (T)

Normandie Univ, UNIROUEN, LITIS, 76000, Rouen, France.

Arnaud Lefebvre (A)

Normandie Univ, UNIROUEN, LITIS, 76000, Rouen, France.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH