Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty.

difficulty prediction heuristics maximum likelihood phylogenetics

Journal

Molecular biology and evolution
ISSN: 1537-1719
Titre abrégé: Mol Biol Evol
Pays: United States
ID NLM: 8501455

Informations de publication

Date de publication:
04 10 2023
Historique:
received: 23 05 2023
revised: 06 09 2023
accepted: 26 09 2023
medline: 23 10 2023
pubmed: 7 10 2023
entrez: 7 10 2023
Statut: ppublish

Résumé

Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).

Identifiants

pubmed: 37804116
pii: 7296053
doi: 10.1093/molbev/msad227
pmc: PMC10584362
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.

Déclaration de conflit d'intérêts

Conflict of interests statement None declared.

Références

Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10751-6
pubmed: 11526218
Infect Genet Evol. 2020 Sep;83:104351
pubmed: 32387564
J Mol Evol. 1981;17(6):368-76
pubmed: 7288891
Mol Biol Evol. 2022 Dec 5;39(12):
pubmed: 36395091
Mol Biol Evol. 1987 Jul;4(4):406-25
pubmed: 3447015
Syst Biol. 2007 Dec;56(6):988-1010
pubmed: 18066931
Mol Biol Evol. 2020 May 1;37(5):1530-1534
pubmed: 32011700
Mol Biol Evol. 2015 Jan;32(1):268-74
pubmed: 25371430
PLoS One. 2010 Mar 10;5(3):e9490
pubmed: 20224823
Mol Biol Evol. 1997 Jul;14(7):717-24
pubmed: 9214744
PLoS One. 2011;6(11):e27731
pubmed: 22132132
Syst Biol. 2010 May;59(3):307-21
pubmed: 20525638
Syst Biol. 2017 Jan 1;66(1):e83-e94
pubmed: 28173538
Bioinformatics. 2014 May 1;30(9):1312-3
pubmed: 24451623
Bioinform Adv. 2023 Sep 14;3(1):vbad124
pubmed: 37750068
Biometrics. 1999 Mar;55(1):1-12
pubmed: 11318142
IEEE/ACM Trans Comput Biol Bioinform. 2006 Jan-Mar;3(1):92-4
pubmed: 17048396
Bioinformatics. 2019 Nov 1;35(21):4453-4455
pubmed: 31070718
Mol Biol Evol. 2021 May 4;38(5):1777-1791
pubmed: 33316067
Mol Biol Evol. 2002 Jul;19(7):1171-80
pubmed: 12082136
Bioinformatics. 2022 Mar 4;38(6):1741-1742
pubmed: 34962976
Genome Biol Evol. 2019 Dec 1;11(12):3341-3352
pubmed: 31536115
Mol Biol Evol. 2018 Feb 1;35(2):486-503
pubmed: 29177474

Auteurs

Anastasis Togkousidis (A)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.

Oleksiy M Kozlov (OM)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.

Julia Haag (J)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.

Dimitri Höhler (D)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.

Alexandros Stamatakis (A)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.
Institute of Theoretical Informatics, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany.
Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology - Hellas, GR - 711 10 Heraklion, Crete, Greece.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins

Classifications MeSH