Benchmarking short and long read polishing tools for nanopore assemblies: achieving near-perfect genomes for outbreak isolates.
Salmonella
Assembly polishing
Bacterial genomics
Benchmarking
Food poisoning outbreaks
Long read sequencing
Nanopore sequencing
Source tracking investigations
Journal
BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258
Informations de publication
Date de publication:
08 Jul 2024
08 Jul 2024
Historique:
received:
29
02
2024
accepted:
01
07
2024
medline:
9
7
2024
pubmed:
9
7
2024
entrez:
8
7
2024
Statut:
epublish
Résumé
Oxford Nanopore provides high throughput sequencing platforms able to reconstruct complete bacterial genomes with 99.95% accuracy. However, even small levels of error can obscure the phylogenetic relationships between closely related isolates. Polishing tools have been developed to correct these errors, but it is uncertain if they obtain the accuracy needed for the high-resolution source tracking of foodborne illness outbreaks. We tested 132 combinations of assembly and short- and long-read polishing tools to assess their accuracy for reconstructing the genome sequences of 15 highly similar Salmonella enterica serovar Newport isolates from a 2020 onion outbreak. While long-read polishing alone improved accuracy, near perfect accuracy (99.9999% accuracy or ~ 5 nucleotide errors across the 4.8 Mbp genome, excluding low confidence regions) was only obtained by pipelines that combined both long- and short-read polishing tools. Notably, medaka was a more accurate and efficient long-read polisher than Racon. Among short-read polishers, NextPolish showed the highest accuracy, but Pilon, Polypolish, and POLCA performed similarly. Among the 5 best performing pipelines, polishing with medaka followed by NextPolish was the most common combination. Importantly, the order of polishing tools mattered i.e., using less accurate tools after more accurate ones introduced errors. Indels in homopolymers and repetitive regions, where the short reads could not be uniquely mapped, remained the most challenging errors to correct. Short reads are still needed to correct errors in nanopore sequenced assemblies to obtain the accuracy required for source tracking investigations. Our granular assessment of the performance of the polishing pipelines allowed us to suggest best practices for tool users and areas for improvement for tool developers.
Sections du résumé
BACKGROUND
BACKGROUND
Oxford Nanopore provides high throughput sequencing platforms able to reconstruct complete bacterial genomes with 99.95% accuracy. However, even small levels of error can obscure the phylogenetic relationships between closely related isolates. Polishing tools have been developed to correct these errors, but it is uncertain if they obtain the accuracy needed for the high-resolution source tracking of foodborne illness outbreaks.
RESULTS
RESULTS
We tested 132 combinations of assembly and short- and long-read polishing tools to assess their accuracy for reconstructing the genome sequences of 15 highly similar Salmonella enterica serovar Newport isolates from a 2020 onion outbreak. While long-read polishing alone improved accuracy, near perfect accuracy (99.9999% accuracy or ~ 5 nucleotide errors across the 4.8 Mbp genome, excluding low confidence regions) was only obtained by pipelines that combined both long- and short-read polishing tools. Notably, medaka was a more accurate and efficient long-read polisher than Racon. Among short-read polishers, NextPolish showed the highest accuracy, but Pilon, Polypolish, and POLCA performed similarly. Among the 5 best performing pipelines, polishing with medaka followed by NextPolish was the most common combination. Importantly, the order of polishing tools mattered i.e., using less accurate tools after more accurate ones introduced errors. Indels in homopolymers and repetitive regions, where the short reads could not be uniquely mapped, remained the most challenging errors to correct.
CONCLUSIONS
CONCLUSIONS
Short reads are still needed to correct errors in nanopore sequenced assemblies to obtain the accuracy required for source tracking investigations. Our granular assessment of the performance of the polishing pipelines allowed us to suggest best practices for tool users and areas for improvement for tool developers.
Identifiants
pubmed: 38978005
doi: 10.1186/s12864-024-10582-x
pii: 10.1186/s12864-024-10582-x
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
679Informations de copyright
© 2024. This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply.
Références
Hou Y-CC, Yu H-C, Martin R, Cirulli ET, Schenker-Ahmed NM, Hicks M, et al. Precision medicine integrating whole-genome sequencing, comprehensive metabolomics, and advanced imaging. Proc Natl Acad Sci U S A. 2020;117:3053–62.
pubmed: 31980526
pmcid: 7022190
doi: 10.1073/pnas.1909378117
Aragona M, Haegi A, Valente MT, Riccioni L, Orzali L, Vitale S, et al. New-Generation Sequencing Technology in Diagnosis of Fungal Plant Pathogens: A Dream Comes True? J Fungi (Basel). 2022;8:737.
pubmed: 35887492
pmcid: 9320658
doi: 10.3390/jof8070737
Kumar A, Singh J, Ferreira LFR. Microbiome Under Changing Climate: Implications and Solutions. Woodhead Publishing; 2022.
Srivastava S, Banu S, Singh P, Sowpati DT, Mishra RK. SARS-CoV-2 genomics: An Indian perspective on sequencing viral variants. J Biosci. 2021;46:1–14.
doi: 10.1007/s12038-021-00145-7
Chen C, Zhang Y, Yu S-L, Zhou Y, Yang S-Y, Jin J-L, et al. Tracking carbapenem-producing klebsiella pneumoniae outbreak in an intensive care unit by whole genome sequencing. Front Cell Infect Microbiol. 2019;9:281.
pubmed: 31440476
pmcid: 6694789
doi: 10.3389/fcimb.2019.00281
Allard MW, Strain E, Melka D, Bunning K, Musser SM, Brown EW, et al. Practical value of food pathogen traceability through building a whole-genome sequencing network and database. J Clin Microbiol. 2016;54:1975–83.
pubmed: 27008877
pmcid: 4963501
doi: 10.1128/JCM.00081-16
Swaminathan B, Barrett TJ, Hunter SB, Tauxe RV, CDC PulseNet Task Force. PulseNet: the molecular subtyping network for foodborne bacterial disease surveillance, United States. Emerg Infect Dis. 2001;7:382–9.
Schirmer M, D’Amore R, Ijaz UZ, Hall N, Quince C. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics. 2016;17:125.
pubmed: 26968756
pmcid: 4787001
doi: 10.1186/s12859-016-0976-y
Sevillya G, Snir S. Synteny footprints provide clearer phylogenetic signal than sequence data for prokaryotic classification. Mol Phylogenet Evol. 2019;136:128–37.
pubmed: 30946898
doi: 10.1016/j.ympev.2019.03.010
Sevillya G. Relation between two evolutionary clocks reveal new insights in bacterial evolution. Access Microbiol. 2022;4: 000265.
pubmed: 35355876
pmcid: 8941958
doi: 10.1099/acmi.0.000265
Avershina E, Rudi K. Dominant short repeated sequences in bacterial genomes. Genomics. 2015;105:175–81.
pubmed: 25561351
doi: 10.1016/j.ygeno.2014.12.009
Moss EL, Maghini DG, Bhatt AS. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat Biotechnol. 2020;38:701–7.
pubmed: 32042169
pmcid: 7283042
doi: 10.1038/s41587-020-0422-6
Commichaux S, Javkar K, Ramachandran P, Nagarajan N, Bertrand D, Chen Y, et al. Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads. BMC Genomics. 2021;22:389.
pubmed: 34039264
pmcid: 8157722
doi: 10.1186/s12864-021-07702-2
Chen Z, Kuang D, Xu X, González-Escalona N, Erickson DL, Brown E, et al. Genomic analyses of multidrug-resistant Salmonella Indiana, Typhimurium, and Enteritidis isolates using MinION and MiSeq sequencing technologies. PLoS ONE. 2020;15: e0235641.
pubmed: 32614888
pmcid: 7332006
doi: 10.1371/journal.pone.0235641
Stahlecker J, Mingyar E, Ziemert N, Mungan MD. SYN-View: A Phylogeny-Based Synteny Exploration Tool for the Identification of Gene Clusters Linked to Antibiotic Resistance. Molecules. 2020;26:144.
pubmed: 33396183
pmcid: 7795190
doi: 10.3390/molecules26010144
Yelton AP, Thomas BC, Simmons SL, Wilmes P, Zemla A, Thelen MP, et al. A semi-quantitative, synteny-based method to improve functional predictions for hypothetical and poorly annotated bacterial and archaeal genes. PLoS Comput Biol. 2011;7: e1002230.
pubmed: 22028637
pmcid: 3197636
doi: 10.1371/journal.pcbi.1002230
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376:44–53.
pubmed: 35357919
pmcid: 9186530
doi: 10.1126/science.abj6987
Albertsen M. Long-read metagenomics paves the way toward a complete microbial tree of life. Nat Methods. 2023;20:30–1.
pubmed: 36635540
doi: 10.1038/s41592-022-01726-6
Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature. 2016;530:228–32.
pubmed: 26840485
pmcid: 4817224
doi: 10.1038/nature16996
Stüder F, Petit J-L, Engelen S, Mendoza-Parra MA. Real-time SARS-CoV-2 diagnostic and variants tracking over multiple candidates using nanopore DNA sequencing. Sci Rep. 2021;11:15869.
pubmed: 34354202
pmcid: 8342707
doi: 10.1038/s41598-021-95563-w
Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol. 2013;14:405.
pubmed: 23822731
pmcid: 3953343
doi: 10.1186/gb-2013-14-6-405
Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat Rev Genet. 2020;21:597–614.
pubmed: 32504078
pmcid: 7877196
doi: 10.1038/s41576-020-0236-x
Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods. 2020;17:1103–10.
pubmed: 33020656
pmcid: 10699202
doi: 10.1038/s41592-020-00971-x
Bickhart DM, Kolmogorov M, Tseng E, Portik DM, Korobeynikov A, Tolstoganov I, et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat Biotechnol. 2022;40:711–9.
pubmed: 34980911
doi: 10.1038/s41587-021-01130-z
Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–6.
pubmed: 30936562
doi: 10.1038/s41587-019-0072-8
Wang L, Qu L, Yang L, Wang Y, Zhu H. NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm. Front Genet. 2020;11:900.
pubmed: 32903372
pmcid: 7434944
doi: 10.3389/fgene.2020.00900
Dohm JC, Peters P, Stralis-Pavese N, Himmelbauer H. Benchmarking of long-read correction methods. NAR Genom Bioinform. 2020;2:lqaa037.
Gillesberg Lassen S, Ethelberg S, Björkman JT, Jensen T, Sørensen G, Kvistholm Jensen A, et al. Two listeria outbreaks caused by smoked fish consumption—using whole-genome sequencing for outbreak investigations. Clin Microbiol Infect. 2016;22:620–4.
pubmed: 27145209
doi: 10.1016/j.cmi.2016.04.017
Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS ONE. 2021;16:e0257521.
pubmed: 34597327
pmcid: 8486125
doi: 10.1371/journal.pone.0257521
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
pubmed: 29750242
pmcid: 6137996
doi: 10.1093/bioinformatics/bty191
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.
pubmed: 19451168
pmcid: 2705234
doi: 10.1093/bioinformatics/btp324
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.
pubmed: 22388286
pmcid: 3322381
doi: 10.1038/nmeth.1923
Watson M, Warr A. Errors in long-read assemblies can critically affect protein prediction. Nat Biotechnol. 2019;37:124–6.
pubmed: 30670796
doi: 10.1038/s41587-018-0004-z
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–45.
pubmed: 29431738
pmcid: 5889714
doi: 10.1038/nbt.4060
Greig DR, Jenkins C, Gharbia S, Dallman TJ. Comparison of single-nucleotide variants identified by Illumina and Oxford Nanopore technologies in the context of a potential outbreak of Shiga toxin–producing Escherichia coli. GigaScience. 2019;8:giz104.
Xian Z, Li S, Mann DA, Huang Y, Xu F, Wu X, et al. Subtyping Evaluation of Salmonella Enteritidis Using Single Nucleotide Polymorphism and Core Genome Multilocus Sequence Typing with Nanopore Reads. Appl Environ Microbiol. 2022;88: e0078522.
pubmed: 35867567
doi: 10.1128/aem.00785-22
Mey AR, Gómez-Garzón C, Payne SM. Iron Transport and Metabolism in Escherichia, Shigella, and Salmonella. EcoSal Plus. 2021;9:eESP00342020.
Murphy KC. Phage recombinases and their applications. Adv Virus Res. 2012;83:367–414.
pubmed: 22748814
doi: 10.1016/B978-0-12-394438-2.00008-6
Reams AB, Kofoid E, Kugelberg E, Roth JR. Multiple pathways of duplication formation with and without recombination (RecA) in Salmonella enterica. Genetics. 2012;192:397–415.
pubmed: 22865732
pmcid: 3454872
doi: 10.1534/genetics.112.142570
Wyckoff TJ, Taylor JA, Salama NR. Beyond growth: novel functions for bacterial cell wall hydrolases. Trends Microbiol. 2012;20:540–7.
pubmed: 22944244
pmcid: 3479350
doi: 10.1016/j.tim.2012.08.003
Xu X, Shi H, Gong X, Chen P, Gao Y, Zhang X, et al. Structural insights into sodium transport by the oxaloacetate decarboxylase sodium pump. Elife. 2020;9:e53853.
pubmed: 32459174
pmcid: 7274780
doi: 10.7554/eLife.53853
Wang H, Tang Z, Xue B, Lu Q, Liu X, Zou Q. Salmonella Regulator STM0347 Mediates Flagellar Phase Variation via Hin Invertase. Int J Mol Sci. 2022;23:8481.
pubmed: 35955615
pmcid: 9368917
doi: 10.3390/ijms23158481
Commichaux S, Rand H, Javkar K, Molloy EK, Pettengill JB, Pightling A, et al. Assessment of plasmids for relating the 2020 Salmonella enterica serovar Newport onion outbreak to farms implicated by the outbreak investigation. BMC Genomics. 2023;24:165.
pubmed: 37016310
pmcid: 10074901
doi: 10.1186/s12864-023-09245-0
Johnson J, Soehnlen M, Blankenship HM. Long read genome assemblers struggle with small plasmids. Microb Genom. 2023;9:001024.
Ni Y, Liu X, Simeneh ZM, Yang M, Li R. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells in single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput Struct Biotechnol J. 2023;21:2352–64.
Wick RR, Holt KE. Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLoS Comput Biol. 2022;18: e1009802.
pubmed: 35073327
pmcid: 8812927
doi: 10.1371/journal.pcbi.1009802
Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol. 2020;16: e1007981.
pubmed: 32589667
pmcid: 7347232
doi: 10.1371/journal.pcbi.1007981
Lang D, Zhang S, Ren P, Liang F, Sun Z, Meng G, et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Bioscienes Sequel II system and ultralong reads of Oxford Nanopore. Gigascience. 2020;9:giaa123.
Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3:lqab019.
Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A, Rand H, et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput Sci. 2015;1: e20.
doi: 10.7717/peerj-cs.20
Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res. 2019;8:2138.
Neubert K, Zuchantke E, Leidenfrost RM, Wünschiers R, Grützke J, Malorny B, et al. Testing assembly strategies of Francisella tularensis genomes to infer an evolutionary conservation analysis of genomic structures. BMC Genomics. 2021;22:822.
pubmed: 34773979
pmcid: 8590783
doi: 10.1186/s12864-021-08115-x
Freire B, Ladra S, Parama JR. Memory-Efficient Assembly using Flye. IEEE/ACM Trans Comput Biol Bioinform. 2021;19:3564–77.
Center for Food Safety, Nutrition A. Outbreak Investigation of Salmonella Newport: Red Onions (July 2020). U.S. Food and Drug Administration. https://www.fda.gov/food/outbreaks-foodborne-illness/outbreak-investigation-salmonella-newport-red-onions-july-2020 . Accessed 2 Feb 2023.
Software downloads. PacBio. 2015. https://www.pacb.com/support/software-downloads/ . Accessed 12 Jun 2023.
Seemann T. berokka: Trim, circularise and orient long read bacterial genome assemblies. https://github.com/tseemann/berokka . Accessed Jan 2023.
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9.
pubmed: 24642063
doi: 10.1093/bioinformatics/btu153
Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A. Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microb Genom. 2020;6:e000398.
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31:3691–3.
pubmed: 26198102
pmcid: 4817141
doi: 10.1093/bioinformatics/btv421
Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002;30:2478–83.
pubmed: 12034836
pmcid: 117189
doi: 10.1093/nar/30.11.2478
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–77.
pubmed: 22506599
pmcid: 3342519
doi: 10.1089/cmb.2012.0021
Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37:1155–62.
pubmed: 31406327
pmcid: 6776680
doi: 10.1038/s41587-019-0217-9
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017;13: e1005595.
pubmed: 28594827
pmcid: 5481147
doi: 10.1371/journal.pcbi.1005595
Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, Vezina B, et al. Trycycler: consensus long-read assemblies for bacterial genomes. Genome Biol. 2021;22:266.
pubmed: 34521459
pmcid: 8442456
doi: 10.1186/s13059-021-02483-z
Chen Z, Erickson DL, Meng J. Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing. BMC Genomics. 2020;21:1–21.
doi: 10.1186/s12864-020-07041-8
Khezri A, Avershina E, Ahmad R. Hybrid Assembly Provides Improved Resolution of Plasmids, Antimicrobial Resistance Genes, and Virulence Factors in Escherichia coli and Klebsiella pneumoniae Clinical Isolates. Microorganisms. 2021;9:2560.
pubmed: 34946161
pmcid: 8704702
doi: 10.3390/microorganisms9122560
Wick R. Unicycler: hybrid assembly pipeline for bacterial genomes. https://github.com/rrwick/Unicycler . Accessed Jan 2023.
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9: e112963.
pubmed: 25409509
pmcid: 4237348
doi: 10.1371/journal.pone.0112963
medaka: Sequence correction provided by ONT Research. https://github.com/rrwick/Unicycler . Accessed Jan 2023.
Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27:737–46.
pubmed: 28100585
pmcid: 5411768
doi: 10.1101/gr.214270.116
Hu J, Fan J, Sun Z, Liu S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics. 2020;36:2253–5.
pubmed: 31778144
doi: 10.1093/bioinformatics/btz891
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, et al. ntEdit: scalable genome sequence polishing. Bioinformatics. 2019;35:4430–2.
pubmed: 31095290
pmcid: 6821332
doi: 10.1093/bioinformatics/btz400
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint. 2012. arXiv:1207.3907.
De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018;34:2666–9.
pubmed: 29547981
pmcid: 6061794
doi: 10.1093/bioinformatics/bty149
Wick RR, Judd LM, Holt KE. Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLoS Comput Biol. 2023;19: e1010905.
pubmed: 36862631
pmcid: 9980784
doi: 10.1371/journal.pcbi.1010905
Latorre-Pérez A, Villalba-Bermell P, Pascual J, Vilanova C. Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Sci Rep. 2020;10:13588.
pubmed: 32788623
pmcid: 7423617
doi: 10.1038/s41598-020-70491-3