From Easy to Hopeless-Predicting the Difficulty of Phylogenetic Analyses.

machine learning maximum likelihood phylogenetics random forest regression

Journal

Molecular biology and evolution
ISSN: 1537-1719
Titre abrégé: Mol Biol Evol
Pays: United States
ID NLM: 8501455

Informations de publication

Date de publication:
05 12 2022
Historique:
pubmed: 18 11 2022
medline: 15 12 2022
entrez: 17 11 2022
Statut: ppublish

Résumé

Phylogenetic analyzes under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.

Identifiants

pubmed: 36395091
pii: 6832260
doi: 10.1093/molbev/msac254
pmc: PMC9728795
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.

Références

Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10751-6
pubmed: 11526218
Mol Biol Evol. 2002 Dec;19(12):2051-9
pubmed: 12446797
Evol Bioinform Online. 2007 Feb 17;2:7-22
pubmed: 19455198
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Bioinformatics. 2021 Dec 28;:
pubmed: 34962976
Syst Biol. 2012 May;61(3):539-42
pubmed: 22357727
Mol Biol Evol. 2007 Sep;24(9):2029-39
pubmed: 17630280
Mol Biol Evol. 2004 Aug;21(8):1565-71
pubmed: 15163768
J Comput Biol. 2010 Mar;17(3):337-54
pubmed: 20377449
Syst Biol. 2008 Feb;57(1):86-103
pubmed: 18278678
Bioinformatics. 2014 May 1;30(9):1312-3
pubmed: 24451623
Mol Biol Evol. 2021 May 4;38(5):1777-1791
pubmed: 33316067
Mol Biol Evol. 2002 Jul;19(7):1171-80
pubmed: 12082136
Mol Biol Evol. 1996 Jul;13(6):749-57
pubmed: 8754211
Mol Biol Evol. 2020 May 1;37(5):1530-1534
pubmed: 32011700
Mol Phylogenet Evol. 1992 Sep;1(3):242-52
pubmed: 1342941
Mol Biol Evol. 1990 Nov;7(6):607-33
pubmed: 2283953
PLoS One. 2010 Mar 10;5(3):e9490
pubmed: 20224823

Auteurs

Julia Haag (J)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.

Dimitri Höhler (D)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.

Ben Bettisworth (B)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.

Alexandros Stamatakis (A)

Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins
Drought Resistance Gene Expression Profiling Gene Expression Regulation, Plant Gossypium Multigene Family

Classifications MeSH