Heuristic algorithms for best match graph editing.
Arc modification problems
Consistent algorithm
Heuristic algorithm
NP-hardness
Journal
Algorithms for molecular biology : AMB
ISSN: 1748-7188
Titre abrégé: Algorithms Mol Biol
Pays: England
ID NLM: 101265088
Informations de publication
Date de publication:
17 Aug 2021
17 Aug 2021
Historique:
received:
17
03
2021
accepted:
26
06
2021
entrez:
18
8
2021
pubmed:
19
8
2021
medline:
19
8
2021
Statut:
epublish
Résumé
Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho's supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing. Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.
Sections du résumé
BACKGROUND
BACKGROUND
Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data.
RESULTS
RESULTS
Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho's supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing.
CONCLUSION
CONCLUSIONS
Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.
Identifiants
pubmed: 34404422
doi: 10.1186/s13015-021-00196-3
pii: 10.1186/s13015-021-00196-3
pmc: PMC8369769
doi:
Types de publication
Journal Article
Langues
eng
Pagination
19Subventions
Organisme : Deutsche Forschungsgemeinschaft<
ID : STA 850/49-1
Organisme : competence centers for excellent technologies
ID : 865891
Informations de copyright
© 2021. The Author(s).
Références
J Bioinform Comput Biol. 2006 Feb;4(1):59-74
pubmed: 16568542
BMC Bioinformatics. 2011 Apr 28;12:124
pubmed: 21526987
BMC Genomics. 2014 Jun 25;15:522
pubmed: 24965762
Algorithms Mol Biol. 2020 Apr 9;15:5
pubmed: 32308731
Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Sep;92(3):032801
pubmed: 26465522
Genes (Basel). 2018 Feb 28;9(3):
pubmed: 29495636
Trends Ecol Evol. 1994 Aug;9(8):297-8
pubmed: 21236859
Bioinformatics. 2018 Jul 1;34(13):i366-i375
pubmed: 29950018
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
Brief Bioinform. 2016 Nov;17(6):1009-1023
pubmed: 26615024
J Math Biol. 2021 Feb 19;82(3):20
pubmed: 33606106
PLoS One. 2014 Aug 19;9(8):e105015
pubmed: 25137074
PLoS One. 2010 Oct 15;5(10):e13409
pubmed: 20976221
Protein Eng. 1999 Feb;12(2):85-94
pubmed: 10195279
J Theor Biol. 1987 Sep 7;128(1):11-45
pubmed: 3431131
J Math Biol. 2019 Jun;78(7):2015-2057
pubmed: 30968198
Evolution. 2002 Jul;56(7):1317-30
pubmed: 12206234
J Math Biol. 2021 Apr 5;82(6):47
pubmed: 33818665
J Theor Biol. 2016 May 21;397:89-102
pubmed: 26953649
Bioinformatics. 2008 Feb 1;24(3):319-24
pubmed: 18042555
BMC Genomics. 2020 Oct 24;21(1):741
pubmed: 33099302