Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
06 2022
Historique:
received: 13 07 2021
accepted: 07 03 2022
pubmed: 2 4 2022
medline: 14 6 2022
entrez: 1 4 2022
Statut: ppublish

Résumé

Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.

Identifiants

pubmed: 35361932
doi: 10.1038/s41592-022-01445-y
pii: 10.1038/s41592-022-01445-y
pmc: PMC9745813
mid: NIHMS1849845
doi:

Types de publication

Journal Article Research Support, N.I.H., Intramural Research Support, Non-U.S. Gov't Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

696-704

Subventions

Organisme : NHGRI NIH HHS
ID : U01 HG010961
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG010972
Pays : United States
Organisme : Howard Hughes Medical Institute
Pays : United States
Organisme : NIH HHS
ID : OT2 OD026682
Pays : United States
Organisme : NHGRI NIH HHS
ID : U01 HG010971
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG011853
Pays : United States
Organisme : NHGRI NIH HHS
ID : R01 HG010485
Pays : United States
Organisme : Intramural NIH HHS
ID : ZIA HG200398
Pays : United States

Informations de copyright

© 2022. This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.

Références

Genome Res. 2020 Sep;30(9):1291-1305
pubmed: 32801147
Nat Methods. 2022 Jun;19(6):705-710
pubmed: 35365778
Nucleic Acids Res. 2015 Jan;43(Database issue):D130-7
pubmed: 25392425
Nucleic Acids Res. 1997 Mar 1;25(5):955-64
pubmed: 9023104
Gigascience. 2017 Aug 1;6(8):1-9
pubmed: 28873962
Genome Res. 2017 May;27(5):737-746
pubmed: 28100585
Nature. 2020 Sep;585(7823):79-84
pubmed: 32663838
Nat Biotechnol. 2018 Oct 22;:
pubmed: 30346939
Nat Methods. 2021 Feb;18(2):170-175
pubmed: 33526886
Nat Biotechnol. 2011 Jan;29(1):24-6
pubmed: 21221095
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Nat Methods. 2018 Aug;15(8):595-597
pubmed: 30013044
Cell Genom. 2022 May 11;2(5):
pubmed: 35720974
Science. 2022 Apr;376(6588):eabj5089
pubmed: 35357915
Cell Genom. 2022 May;2(5):
pubmed: 36452119
Comput Struct Biotechnol J. 2019 Nov 17;18:9-19
pubmed: 31890139
Nat Methods. 2022 Jun;19(6):687-695
pubmed: 35361931
Genome Res. 2017 May;27(5):677-685
pubmed: 27895111
Brief Bioinform. 2014 Nov;15(6):879-89
pubmed: 24067931
Bioinformatics. 2006 Jan 15;22(2):134-41
pubmed: 16287941
Biol Direct. 2008 May 21;3:20
pubmed: 18495041
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Nucleic Acids Res. 2012 May;40(10):e72
pubmed: 22323520
Genome Biol. 2020 Sep 14;21(1):245
pubmed: 32928274
Nat Biotechnol. 2018 Nov;36(10):983-987
pubmed: 30247488
Genome Med. 2020 Oct 26;12(1):91
pubmed: 33106175
Nature. 2021 Jun;594(7862):227-233
pubmed: 33910227
Nature. 2021 Apr;592(7856):737-746
pubmed: 33911273
Nat Biotechnol. 2019 May;37(5):540-546
pubmed: 30936562
Bioinformatics. 2017 Feb 15;33(4):574-576
pubmed: 27797770
Genome Biol. 2008;9(3):R55
pubmed: 18341692
Nat Biotechnol. 2019 Oct;37(10):1155-1162
pubmed: 31406327
Bioinformatics. 2011 Nov 1;27(21):2987-93
pubmed: 21903627
Bioinformatics. 2020 Jul 1;36(Suppl_1):i111-i118
pubmed: 32657365
Nat Commun. 2020 Mar 18;11(1):1432
pubmed: 32188846
Science. 2022 Apr;376(6588):eabj6965
pubmed: 35357917
Nat Biotechnol. 2019 May;37(5):555-560
pubmed: 30858580
Sci Data. 2016 Jun 07;3:160025
pubmed: 27271295
PLoS One. 2014 Nov 19;9(11):e112963
pubmed: 25409509
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838
Nat Biotechnol. 2019 Feb;37(2):124-126
pubmed: 30670796
Bioinformatics. 2004 Oct 12;20(15):2421-8
pubmed: 15087315

Auteurs

Giulio Formenti (G)

Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA. gformenti@mail.rockefeller.edu.
Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA. gformenti@mail.rockefeller.edu.
Howard Hughes Medical Institute, Chevy Chase, MD, USA. gformenti@mail.rockefeller.edu.

Arang Rhie (A)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. arang.rhie@nih.gov.

Brian P Walenz (BP)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Françoise Thibaud-Nissen (F)

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Kishwar Shafin (K)

UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA.

Sergey Koren (S)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Eugene W Myers (EW)

Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.

Erich D Jarvis (ED)

Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
Howard Hughes Medical Institute, Chevy Chase, MD, USA.

Adam M Phillippy (AM)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH