From Easy to Hopeless-Predicting the Difficulty of Phylogenetic Analyses.
machine learning
maximum likelihood
phylogenetics
random forest regression
Journal
Molecular biology and evolution
ISSN: 1537-1719
Titre abrégé: Mol Biol Evol
Pays: United States
ID NLM: 8501455
Informations de publication
Date de publication:
05 12 2022
05 12 2022
Historique:
pubmed:
18
11
2022
medline:
15
12
2022
entrez:
17
11
2022
Statut:
ppublish
Résumé
Phylogenetic analyzes under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.
Identifiants
pubmed: 36395091
pii: 6832260
doi: 10.1093/molbev/msac254
pmc: PMC9728795
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
Références
Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10751-6
pubmed: 11526218
Mol Biol Evol. 2002 Dec;19(12):2051-9
pubmed: 12446797
Evol Bioinform Online. 2007 Feb 17;2:7-22
pubmed: 19455198
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Bioinformatics. 2021 Dec 28;:
pubmed: 34962976
Syst Biol. 2012 May;61(3):539-42
pubmed: 22357727
Mol Biol Evol. 2007 Sep;24(9):2029-39
pubmed: 17630280
Mol Biol Evol. 2004 Aug;21(8):1565-71
pubmed: 15163768
J Comput Biol. 2010 Mar;17(3):337-54
pubmed: 20377449
Syst Biol. 2008 Feb;57(1):86-103
pubmed: 18278678
Bioinformatics. 2014 May 1;30(9):1312-3
pubmed: 24451623
Mol Biol Evol. 2021 May 4;38(5):1777-1791
pubmed: 33316067
Mol Biol Evol. 2002 Jul;19(7):1171-80
pubmed: 12082136
Mol Biol Evol. 1996 Jul;13(6):749-57
pubmed: 8754211
Mol Biol Evol. 2020 May 1;37(5):1530-1534
pubmed: 32011700
Mol Phylogenet Evol. 1992 Sep;1(3):242-52
pubmed: 1342941
Mol Biol Evol. 1990 Nov;7(6):607-33
pubmed: 2283953
PLoS One. 2010 Mar 10;5(3):e9490
pubmed: 20224823