Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies.
Journal
Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604
Informations de publication
Date de publication:
06 2022
06 2022
Historique:
received:
13
07
2021
accepted:
04
03
2022
pubmed:
2
4
2022
medline:
14
6
2022
entrez:
1
4
2022
Statut:
ppublish
Résumé
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Identifiants
pubmed: 35361931
doi: 10.1038/s41592-022-01440-3
pii: 10.1038/s41592-022-01440-3
pmc: PMC9812399
mid: NIHMS1850370
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
687-695Subventions
Organisme : NHGRI NIH HHS
ID : R01 HG006677
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010961
Pays : United States
Organisme : Wellcome Trust
ID : WT206194
Pays : United Kingdom
Organisme : NHGRI NIH HHS
ID : U41 HG010972
Pays : United States
Organisme : NIGMS NIH HHS
ID : F32 GM134558
Pays : United States
Organisme : NIH HHS
ID : OT2 OD026682
Pays : United States
Organisme : Intramural NIH HHS
ID : Z99 HG999999
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG011274
Pays : United States
Organisme : NHGRI NIH HHS
ID : R21 HG010548
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG011853
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010485
Pays : United States
Organisme : Wellcome Trust
Pays : United Kingdom
Organisme : Intramural NIH HHS
ID : ZIA HG200398
Pays : United States
Commentaires et corrections
Type : CommentIn
Informations de copyright
© 2022. This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.
Références
Genome Res. 2020 Sep;30(9):1291-1305
pubmed: 32801147
Gigascience. 2020 Dec 15;9(12):
pubmed: 33319909
Science. 2021 Aug 6;373(6555):655-662
pubmed: 34353948
Nat Rev Genet. 2010 Jan;11(1):31-46
pubmed: 19997069
Genome Res. 2017 May;27(5):737-746
pubmed: 28100585
Gigascience. 2021 Feb 16;10(2):
pubmed: 33590861
Science. 2022 Apr;376(6588):eabl4178
pubmed: 35357911
Science. 2022 Apr;376(6588):eabj6965
pubmed: 35357917
Nat Methods. 2021 Feb;18(2):170-175
pubmed: 33526886
Nat Biotechnol. 2018 Apr;36(4):338-345
pubmed: 29431738
Genome Res. 2017 May;27(5):787-792
pubmed: 28130360
Brief Bioinform. 2013 Mar;14(2):178-92
pubmed: 22517427
Nat Rev Genet. 2020 Oct;21(10):597-614
pubmed: 32504078
Cell Genom. 2022 May 11;2(5):
pubmed: 35720974
Science. 2022 Apr;376(6588):eabj5089
pubmed: 35357915
Science. 2022 Apr;376(6588):44-53
pubmed: 35357919
Nat Methods. 2022 Jun;19(6):705-710
pubmed: 35365778
Nat Methods. 2018 Jun;15(6):461-468
pubmed: 29713083
Bioinformatics. 2020 Dec 15;:
pubmed: 33320174
Science. 2021 Nov 12;374(6569):eabi7489
pubmed: 34762468
Bioinformatics. 2020 Jul 1;36(Suppl_1):i75-i83
pubmed: 32657355
Nat Methods. 2021 Nov;18(11):1322-1332
pubmed: 34725481
Nat Commun. 2017 May 04;8:15324
pubmed: 28469237
Genome Biol. 2020 Sep 14;21(1):245
pubmed: 32928274
Nat Biotechnol. 2018 Nov;36(10):983-987
pubmed: 30247488
PLoS One. 2013 Apr 29;8(4):e62856
pubmed: 23638157
Bioinformatics. 2016 Jul 15;32(14):2103-10
pubmed: 27153593
Nat Methods. 2020 Mar;17(3):261-272
pubmed: 32015543
Nature. 2020 Sep;585(7823):79-84
pubmed: 32663838
Nature. 2021 Apr;592(7856):737-746
pubmed: 33911273
Proc Natl Acad Sci U S A. 1991 Jan 15;88(2):507-11
pubmed: 1988950
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Nat Biotechnol. 2019 Oct;37(10):1155-1162
pubmed: 31406327
BMC Genomics. 2020 Dec 21;21(Suppl 6):889
pubmed: 33349243
Genome Biol. 2020 May 20;21(1):121
pubmed: 32434565
Bioinformatics. 2011 Nov 1;27(21):2987-93
pubmed: 21903627
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118
pubmed: 32657365
Genome Res. 2018 Dec;28(12):1767-1778
pubmed: 30401733
Genome Biol. 2019 Feb 4;20(1):26
pubmed: 30717772
Nat Methods. 2022 Jun;19(6):696-704
pubmed: 35361932
Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773
pubmed: 30357393
Gigascience. 2020 Dec 21;9(12):
pubmed: 33347570
Science. 2021 Apr 2;372(6537):
pubmed: 33632895
Trends Genet. 2014 Sep;30(9):418-26
pubmed: 25108476
Nat Methods. 2015 Aug;12(8):733-5
pubmed: 26076426
Nat Biotechnol. 2022 May;40(5):672-680
pubmed: 35132260
PLoS Comput Biol. 2020 Jun 26;16(6):e1007981
pubmed: 32589667
Science. 2022 Apr;376(6588):eabl3533
pubmed: 35357935
Genome Res. 2012 Mar;22(3):557-67
pubmed: 22147368
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Bioinformatics. 2004 Oct 12;20(15):2421-8
pubmed: 15087315
Genome Res. 2009 Jun;19(6):1117-23
pubmed: 19251739
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242