HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
Journal
Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021
Informations de publication
Date de publication:
09 2020
09 2020
Historique:
received:
14
03
2020
accepted:
04
08
2020
pubmed:
18
8
2020
medline:
30
10
2021
entrez:
18
8
2020
Statut:
ppublish
Résumé
Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.
Identifiants
pubmed: 32801147
pii: gr.263566.120
doi: 10.1101/gr.263566.120
pmc: PMC7545148
doi:
Substances chimiques
DNA, Neoplasm
0
DNA, Satellite
0
Types de publication
Evaluation Study
Journal Article
Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural
Langues
eng
Sous-ensembles de citation
IM
Pagination
1291-1305Subventions
Organisme : NIGMS NIH HHS
ID : F32 GM134558
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010169
Pays : United States
Organisme : NHGRI NIH HHS
ID : R21 HG010548
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG002385
Pays : United States
Informations de copyright
© 2020 Nurk et al.; Published by Cold Spring Harbor Laboratory Press.
Références
J Mol Evol. 1987;25(3):207-14
pubmed: 2822935
Nature. 2012 Nov 1;491(7422):56-65
pubmed: 23128226
Nat Biotechnol. 2015 Jun;33(6):623-30
pubmed: 26006009
Bioinformatics. 2017 Jul 15;33(14):2202-2204
pubmed: 28369201
Genome Biol. 2004;5(2):R12
pubmed: 14759262
Genome Biol. 2020 Sep 14;21(1):245
pubmed: 32928274
Genome Res. 2008 May;18(5):821-9
pubmed: 18349386
Genomics. 1999 Mar 15;56(3):274-87
pubmed: 10087194
Methods Mol Biol. 2010;673:1-17
pubmed: 20835789
Front Immunol. 2012 Oct 08;3:294
pubmed: 23060878
Bioinformatics. 2011 Nov 1;27(21):2964-71
pubmed: 21926123
Science. 2016 Apr 1;352(6281):aae0344
pubmed: 27034376
Science. 2001 Oct 5;294(5540):109-15
pubmed: 11588252
Genome Res. 2016 Nov;26(11):1453-1467
pubmed: 27803192
Genome Biol. 2013;14(9):R101
pubmed: 24034426
Proc Natl Acad Sci U S A. 2009 Jan 20;106(3):853-8
pubmed: 19131514
Genomics. 1996 Dec 15;38(3):325-30
pubmed: 8975709
Bioinformatics. 2008 Dec 15;24(24):2818-24
pubmed: 18952627
Nature. 2004 Aug 19;430(7002):857-64
pubmed: 15318213
Nat Commun. 2020 Sep 22;11(1):4794
pubmed: 32963235
Science. 2002 Aug 9;297(5583):1003-7
pubmed: 12169732
Bioinformatics. 2019 Nov 1;35(21):4394-4396
pubmed: 30942877
Genome Biol. 2019 Aug 26;20(1):174
pubmed: 31451112
Bioinformatics. 2016 Nov 1;32(21):3321-3323
pubmed: 27378299
Cytogenet Cell Genet. 1988;47(3):144-8
pubmed: 2837365
Science. 2018 Jun 8;360(6393):
pubmed: 29880660
Nucleus. 2017 Jul 4;8(4):331-339
pubmed: 28406740
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
Nature. 2020 Sep;585(7823):79-84
pubmed: 32663838
Science. 2002 Oct 4;298(5591):129-49
pubmed: 12364791
Nat Methods. 2019 Jan;16(1):88-94
pubmed: 30559433
Ann Hum Genet. 2020 Mar;84(2):125-140
pubmed: 31711268
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Nat Biotechnol. 2020 Jun 15;:
pubmed: 32541955
BMC Genomics. 2010 Mar 23;11:195
pubmed: 20331851
Nat Biotechnol. 2019 Oct;37(10):1155-1162
pubmed: 31406327
Nucleic Acids Res. 1991 Mar 25;19(6):1179-82
pubmed: 2030938
Genome Res. 2017 May;27(5):722-736
pubmed: 28298431
Nature. 2016 Oct 13;538(7624):243-247
pubmed: 27706134
Nat Biotechnol. 2018 Oct 22;:
pubmed: 30346939
Genome Res. 2017 May;27(5):849-864
pubmed: 28396521
Nat Biotechnol. 2020 Sep;38(9):1044-1053
pubmed: 32686750
Genome Res. 1998 Mar;8(3):186-94
pubmed: 9521922
Genome Res. 2010 Feb;20(2):265-72
pubmed: 20019144
Mol Syst Biol. 2005;1:2005.0030
pubmed: 16729065
Bioinformatics. 2018 Sep 1;34(17):i748-i756
pubmed: 30423094
Curr Opin Microbiol. 2015 Feb;23:110-20
pubmed: 25461581
Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53
pubmed: 11504945
Genome Res. 2001 Jun;11(6):1005-17
pubmed: 11381028
Genome Res. 2015 Mar;25(3):445-58
pubmed: 25589440
Nat Methods. 2020 Feb;17(2):155-158
pubmed: 31819265
Bioinformatics. 2013 Apr 15;29(8):1072-5
pubmed: 23422339
Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8
pubmed: 21187386
Nat Biotechnol. 2011 Jan;29(1):24-6
pubmed: 21221095
Nat Biotechnol. 2019 May;37(5):561-566
pubmed: 30936564
Bioinformatics. 2020 May 1;36(9):2896-2898
pubmed: 31971576
Nat Biotechnol. 2018 Apr;36(4):338-345
pubmed: 29431738
Sci Data. 2016 Jun 07;3:160025
pubmed: 27271295
Nat Commun. 2019 Apr 16;10(1):1784
pubmed: 30992455
Chromosome Res. 2018 Sep;26(3):115-138
pubmed: 29974361
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838
Genomics. 1989 Nov;5(4):822-8
pubmed: 2591965
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Nat Methods. 2013 Jun;10(6):563-9
pubmed: 23644548
Bioinformatics. 2004 Oct 12;20(15):2421-8
pubmed: 15087315
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Genomics. 2010 Jun;95(6):315-27
pubmed: 20211242
Nat Genet. 2017 Apr;49(4):643-650
pubmed: 28263316
Mol Biol Evol. 2018 Mar 1;35(3):543-548
pubmed: 29220515
Genome Res. 2005 Aug;15(8):1127-35
pubmed: 16077012
Genome Biol. 2019 Nov 5;20(1):232
pubmed: 31690338
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278