ComHapDet: a spatial community detection algorithm for haplotype assembly.
Graph clustering
Haplotype assembly
Spatial random graph
Journal
BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258
Informations de publication
Date de publication:
09 Sep 2020
09 Sep 2020
Historique:
entrez:
9
9
2020
pubmed:
10
9
2020
medline:
15
5
2021
Statut:
epublish
Résumé
Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data. We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants. Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Sections du résumé
BACKGROUND
BACKGROUND
Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.
RESULTS
RESULTS
We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet - a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.
CONCLUSIONS
CONCLUSIONS
Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.
Identifiants
pubmed: 32900369
doi: 10.1186/s12864-020-06935-x
pii: 10.1186/s12864-020-06935-x
pmc: PMC7488034
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
586Références
BMC Genomics. 2018 Mar 21;19(Suppl 4):191
pubmed: 29589554
IEEE/ACM Trans Comput Biol Bioinform. 2016 May-Jun;13(3):518-30
pubmed: 27295635
J Comput Biol. 2016 Sep;23(9):718-36
pubmed: 27280382
J Comput Biol. 2012 Jun;19(6):577-90
pubmed: 22697235
Bioinformatics. 2008 Aug 15;24(16):i153-9
pubmed: 18689818
Bioinformatics. 2018 Nov 15;34(22):3864-3872
pubmed: 29868858
PLoS Comput Biol. 2014 Mar 27;10(3):e1003502
pubmed: 24675685
Genome Res. 2007 Jul;17(7):1101-10
pubmed: 17567986
Bioinformatics. 2010 Sep 15;26(18):2217-25
pubmed: 20624781
Genet Epidemiol. 2004 Dec;27(4):321-33
pubmed: 15368617
Res Comput Mol Biol. 2017 May;10229:117-133
pubmed: 28808695
PLoS Biol. 2007 Sep 4;5(10):e254
pubmed: 17803354
Genome Res. 2008 Aug;18(8):1336-46
pubmed: 18676820
Nature. 2002 Oct 24;419(6909):832-7
pubmed: 12397357
Bioinformatics. 2016 Dec 15;32(24):3735-3744
pubmed: 27531103
Bioinformatics. 2016 Jun 1;32(11):1610-7
pubmed: 26315913
BMC Genomics. 2015 Apr 03;16:260
pubmed: 25885901
Brief Bioinform. 2002 Mar;3(1):23-31
pubmed: 12002221
Bioinformatics. 2010 Jun 15;26(12):i183-90
pubmed: 20529904
Brief Bioinform. 2018 May 1;19(3):387-403
pubmed: 28065918
Nature. 2011 Jul 10;475(7355):189-95
pubmed: 21743474
Bioinformatics. 2014 Sep 1;30(17):i379-85
pubmed: 25161223
Bioinformatics. 2013 Aug 15;29(16):1938-45
pubmed: 23782612