Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty.
difficulty prediction
heuristics
maximum likelihood
phylogenetics
Journal
Molecular biology and evolution
ISSN: 1537-1719
Titre abrégé: Mol Biol Evol
Pays: United States
ID NLM: 8501455
Informations de publication
Date de publication:
04 10 2023
04 10 2023
Historique:
received:
23
05
2023
revised:
06
09
2023
accepted:
26
09
2023
medline:
23
10
2023
pubmed:
7
10
2023
entrez:
7
10
2023
Statut:
ppublish
Résumé
Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).
Identifiants
pubmed: 37804116
pii: 7296053
doi: 10.1093/molbev/msad227
pmc: PMC10584362
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
Déclaration de conflit d'intérêts
Conflict of interests statement None declared.
Références
Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10751-6
pubmed: 11526218
Infect Genet Evol. 2020 Sep;83:104351
pubmed: 32387564
J Mol Evol. 1981;17(6):368-76
pubmed: 7288891
Mol Biol Evol. 2022 Dec 5;39(12):
pubmed: 36395091
Mol Biol Evol. 1987 Jul;4(4):406-25
pubmed: 3447015
Syst Biol. 2007 Dec;56(6):988-1010
pubmed: 18066931
Mol Biol Evol. 2020 May 1;37(5):1530-1534
pubmed: 32011700
Mol Biol Evol. 2015 Jan;32(1):268-74
pubmed: 25371430
PLoS One. 2010 Mar 10;5(3):e9490
pubmed: 20224823
Mol Biol Evol. 1997 Jul;14(7):717-24
pubmed: 9214744
PLoS One. 2011;6(11):e27731
pubmed: 22132132
Syst Biol. 2010 May;59(3):307-21
pubmed: 20525638
Syst Biol. 2017 Jan 1;66(1):e83-e94
pubmed: 28173538
Bioinformatics. 2014 May 1;30(9):1312-3
pubmed: 24451623
Bioinform Adv. 2023 Sep 14;3(1):vbad124
pubmed: 37750068
Biometrics. 1999 Mar;55(1):1-12
pubmed: 11318142
IEEE/ACM Trans Comput Biol Bioinform. 2006 Jan-Mar;3(1):92-4
pubmed: 17048396
Bioinformatics. 2019 Nov 1;35(21):4453-4455
pubmed: 31070718
Mol Biol Evol. 2021 May 4;38(5):1777-1791
pubmed: 33316067
Mol Biol Evol. 2002 Jul;19(7):1171-80
pubmed: 12082136
Bioinformatics. 2022 Mar 4;38(6):1741-1742
pubmed: 34962976
Genome Biol Evol. 2019 Dec 1;11(12):3341-3352
pubmed: 31536115
Mol Biol Evol. 2018 Feb 1;35(2):486-503
pubmed: 29177474