ComHapDet: a spatial community detection algorithm for haplotype assembly.


Journal

BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258

Informations de publication

Date de publication:
09 Sep 2020
Historique:
entrez: 9 9 2020
pubmed: 10 9 2020
medline: 15 5 2021
Statut: epublish

Résumé

Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

Sections du résumé

BACKGROUND BACKGROUND
Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.
RESULTS RESULTS
We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.
CONCLUSIONS CONCLUSIONS
Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

Identifiants

pubmed: 32900369
doi: 10.1186/s12864-020-06935-x
pii: 10.1186/s12864-020-06935-x
pmc: PMC7488034
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

586

Références

BMC Genomics. 2018 Mar 21;19(Suppl 4):191
pubmed: 29589554
IEEE/ACM Trans Comput Biol Bioinform. 2016 May-Jun;13(3):518-30
pubmed: 27295635
J Comput Biol. 2016 Sep;23(9):718-36
pubmed: 27280382
J Comput Biol. 2012 Jun;19(6):577-90
pubmed: 22697235
Bioinformatics. 2008 Aug 15;24(16):i153-9
pubmed: 18689818
Bioinformatics. 2018 Nov 15;34(22):3864-3872
pubmed: 29868858
PLoS Comput Biol. 2014 Mar 27;10(3):e1003502
pubmed: 24675685
Genome Res. 2007 Jul;17(7):1101-10
pubmed: 17567986
Bioinformatics. 2010 Sep 15;26(18):2217-25
pubmed: 20624781
Genet Epidemiol. 2004 Dec;27(4):321-33
pubmed: 15368617
Res Comput Mol Biol. 2017 May;10229:117-133
pubmed: 28808695
PLoS Biol. 2007 Sep 4;5(10):e254
pubmed: 17803354
Genome Res. 2008 Aug;18(8):1336-46
pubmed: 18676820
Nature. 2002 Oct 24;419(6909):832-7
pubmed: 12397357
Bioinformatics. 2016 Dec 15;32(24):3735-3744
pubmed: 27531103
Bioinformatics. 2016 Jun 1;32(11):1610-7
pubmed: 26315913
BMC Genomics. 2015 Apr 03;16:260
pubmed: 25885901
Brief Bioinform. 2002 Mar;3(1):23-31
pubmed: 12002221
Bioinformatics. 2010 Jun 15;26(12):i183-90
pubmed: 20529904
Brief Bioinform. 2018 May 1;19(3):387-403
pubmed: 28065918
Nature. 2011 Jul 10;475(7355):189-95
pubmed: 21743474
Bioinformatics. 2014 Sep 1;30(17):i379-85
pubmed: 25161223
Bioinformatics. 2013 Aug 15;29(16):1938-45
pubmed: 23782612

Auteurs

Abishek Sankararaman (A)

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA. abishek@utexas.edu.

Haris Vikalo (H)

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.

François Baccelli (F)

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.
Department of Mathematics, The University of Texas at Austin, Austin, TX, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH