Heuristic algorithms for best match graph editing.

Arc modification problems Consistent algorithm Heuristic algorithm NP-hardness

Journal

Algorithms for molecular biology : AMB
ISSN: 1748-7188
Titre abrégé: Algorithms Mol Biol
Pays: England
ID NLM: 101265088

Informations de publication

Date de publication:
17 Aug 2021
Historique:
received: 17 03 2021
accepted: 26 06 2021
entrez: 18 8 2021
pubmed: 19 8 2021
medline: 19 8 2021
Statut: epublish

Résumé

Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho's supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing. Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.

Sections du résumé

BACKGROUND BACKGROUND
Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data.
RESULTS RESULTS
Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho's supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing.
CONCLUSION CONCLUSIONS
Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.

Identifiants

pubmed: 34404422
doi: 10.1186/s13015-021-00196-3
pii: 10.1186/s13015-021-00196-3
pmc: PMC8369769
doi:

Types de publication

Journal Article

Langues

eng

Pagination

19

Subventions

Organisme : Deutsche Forschungsgemeinschaft<
ID : STA 850/49-1
Organisme : competence centers for excellent technologies
ID : 865891

Informations de copyright

© 2021. The Author(s).

Références

J Bioinform Comput Biol. 2006 Feb;4(1):59-74
pubmed: 16568542
BMC Bioinformatics. 2011 Apr 28;12:124
pubmed: 21526987
BMC Genomics. 2014 Jun 25;15:522
pubmed: 24965762
Algorithms Mol Biol. 2020 Apr 9;15:5
pubmed: 32308731
Phys Rev E Stat Nonlin Soft Matter Phys. 2015 Sep;92(3):032801
pubmed: 26465522
Genes (Basel). 2018 Feb 28;9(3):
pubmed: 29495636
Trends Ecol Evol. 1994 Aug;9(8):297-8
pubmed: 21236859
Bioinformatics. 2018 Jul 1;34(13):i366-i375
pubmed: 29950018
Nat Methods. 2015 Jan;12(1):59-60
pubmed: 25402007
Brief Bioinform. 2016 Nov;17(6):1009-1023
pubmed: 26615024
J Math Biol. 2021 Feb 19;82(3):20
pubmed: 33606106
PLoS One. 2014 Aug 19;9(8):e105015
pubmed: 25137074
PLoS One. 2010 Oct 15;5(10):e13409
pubmed: 20976221
Protein Eng. 1999 Feb;12(2):85-94
pubmed: 10195279
J Theor Biol. 1987 Sep 7;128(1):11-45
pubmed: 3431131
J Math Biol. 2019 Jun;78(7):2015-2057
pubmed: 30968198
Evolution. 2002 Jul;56(7):1317-30
pubmed: 12206234
J Math Biol. 2021 Apr 5;82(6):47
pubmed: 33818665
J Theor Biol. 2016 May 21;397:89-102
pubmed: 26953649
Bioinformatics. 2008 Feb 1;24(3):319-24
pubmed: 18042555
BMC Genomics. 2020 Oct 24;21(1):741
pubmed: 33099302

Auteurs

David Schaller (D)

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04109 Leipzig, Leipzig, Germany. sdavid@bioinf.uni-leipzig.de.
Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, D-04107, Leipzig, Germany. sdavid@bioinf.uni-leipzig.de.

Manuela Geiß (M)

Software Competence Center Hagenberg GmbH, Softwarepark 21, A-4232, Hagenberg, Austria.

Marc Hellmuth (M)

Department of Mathematics, Faculty of Science, Stockholm University, SE-10691, Stockholm, Sweden.

Peter F Stadler (PF)

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04109 Leipzig, Leipzig, Germany.
Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, D-04107, Leipzig, Germany.
Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv), and Leipzig Research Center for Civilization Diseases, Universität Leipzig, Augustusplatz 12, D-04107, Leipzig, Germany.
Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, A-1090, Vienna, Austria.
Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Ciudad Universitaria, COL-111321, Bogotá, D.C., Colombia.
Santa Fe Institute, 1399 Hyde Park Rd., NM87501, Santa Fe, USA.

Classifications MeSH