Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
28 04 2021
Historique:
received: 13 05 2020
accepted: 12 11 2020
entrez: 29 4 2021
pubmed: 30 4 2021
medline: 13 5 2021
Statut: epublish

Résumé

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80-91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.

Identifiants

pubmed: 33911078
doi: 10.1038/s41467-020-20536-y
pii: 10.1038/s41467-020-20536-y
pmc: PMC8081726
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

1935

Subventions

Organisme : Howard Hughes Medical Institute
Pays : United States

Références

Science. 2009 Oct 9;326(5950):289-93
pubmed: 19815776
Genome Res. 2017 May;27(5):801-812
pubmed: 27940952
Genome Biol. 2004;5(2):R12
pubmed: 14759262
Nat Methods. 2012 Nov;9(11):1107-12
pubmed: 23042453
J Comput Biol. 2015 Jun;22(6):498-509
pubmed: 25658651
Bioinformatics. 2014 Sep 1;30(17):2503-5
pubmed: 24812344
Nat Biotechnol. 2013 Dec;31(12):1111-8
pubmed: 24185094
Nat Biotechnol. 2013 Dec;31(12):1119-25
pubmed: 24185095
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Bioinformatics. 2018 Sep 15;34(18):3094-3100
pubmed: 29750242
Nat Genet. 2017 Apr;49(4):643-650
pubmed: 28263316
Genome Res. 2017 May;27(5):757-767
pubmed: 28381613
Genet Med. 2018 Jan;20(1):159-163
pubmed: 28640241
Bioinformatics. 2014 Oct 15;30(20):2843-51
pubmed: 24974202
Bioinformatics. 2017 Jul 15;33(14):2202-2204
pubmed: 28369201
Cell. 2014 Dec 18;159(7):1665-80
pubmed: 25497547
BMC Genomics. 2017 Jul 12;18(1):527
pubmed: 28701198
Genome Biol. 2020 Sep 14;21(1):245
pubmed: 32928274
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Nat Commun. 2017 Nov 3;8(1):1293
pubmed: 29101320
Science. 2018 Jun 8;360(6393):
pubmed: 29880660
Nature. 2021 Apr;592(7856):737-746
pubmed: 33911273
Genome Res. 2008 Aug;18(8):1336-46
pubmed: 18676820
BMC Genomics. 2015 Apr 11;16:286
pubmed: 25886820
BMC Bioinformatics. 2018 Nov 29;19(1):460
pubmed: 30497373
Nat Biotechnol. 2018 Oct 22;:
pubmed: 30346939
Genome Biol. 2015 Jan 24;16:13
pubmed: 25651527
Cell Syst. 2016 Jul;3(1):99-101
pubmed: 27467250
PLoS Biol. 2011 Jul;9(7):e1001091
pubmed: 21750661
Gigascience. 2017 Oct 1;6(10):1-16
pubmed: 29020750
PeerJ. 2018 Jun 4;6:e4958
pubmed: 29888139
Nat Commun. 2020 Apr 29;11(1):2071
pubmed: 32350247
Nature. 2015 Jan 29;517(7536):608-11
pubmed: 25383537
Nat Methods. 2021 Feb;18(2):170-175
pubmed: 33526886
Nat Biotechnol. 2021 Mar;39(3):309-312
pubmed: 33288905
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944
pubmed: 29373581
Bioinformatics. 2020 Feb 15;36(4):1260-1261
pubmed: 31504176
Bioinformatics. 2020 May 1;36(9):2896-2898
pubmed: 31971576
Nat Rev Genet. 2018 Jun;19(6):329-346
pubmed: 29599501
Nat Commun. 2019 Apr 16;10(1):1784
pubmed: 30992455
PLoS One. 2012;7(11):e47768
pubmed: 23185243
Nat Methods. 2016 Dec;13(12):1050-1054
pubmed: 27749838

Auteurs

Zev N Kronenberg (ZN)

Phase Genomics, Seattle, WA, USA. zkronenberg@pacificbiosciences.com.
Pacific Biosciences, Menlo Park, CA, USA. zkronenberg@pacificbiosciences.com.

Arang Rhie (A)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.

Sergey Koren (S)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.

Gregory T Concepcion (GT)

Pacific Biosciences, Menlo Park, CA, USA.

Paul Peluso (P)

Pacific Biosciences, Menlo Park, CA, USA.

Katherine M Munson (KM)

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.

David Porubsky (D)

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.

Kristen Kuhn (K)

US Meat Animal Research Center, ARS USDA, Clay Center, NE, USA.

Kathryn A Mueller (KA)

Phase Genomics, Seattle, WA, USA.

Wai Yee Low (WY)

Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia.

Stefan Hiendleder (S)

Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia.

Olivier Fedrigo (O)

Vertebrate Genomes Laboratory, The Rockefeller University, New York, NY, USA.

Ivan Liachko (I)

Phase Genomics, Seattle, WA, USA.

Richard J Hall (RJ)

Pacific Biosciences, Menlo Park, CA, USA.

Adam M Phillippy (AM)

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.

Evan E Eichler (EE)

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

John L Williams (JL)

Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy, SA, Australia.
Dipartimento di Scienze Animali, della Nutrizione e degli Alimenti, Università Cattolica del Sacro Cuore, 29122, Piacenza, Italy.

Timothy P L Smith (TPL)

US Meat Animal Research Center, ARS USDA, Clay Center, NE, USA.

Erich D Jarvis (ED)

Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
Howard Hughes Medical Institute, Chevy Chase, MD, USA.

Shawn T Sullivan (ST)

Phase Genomics, Seattle, WA, USA.

Sarah B Kingan (SB)

Pacific Biosciences, Menlo Park, CA, USA. skingan@pacificbiosciences.com.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH