Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
15 09 2019
Historique:
received: 15 08 2018
revised: 11 01 2019
accepted: 25 01 2019
pubmed: 31 1 2019
medline: 11 6 2020
entrez: 31 1 2019
Statut: ppublish

Résumé

One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 30698642
pii: 5304359
doi: 10.1093/bioinformatics/btz054
pmc: PMC6931352
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

3453-3460

Subventions

Organisme : NIAID NIH HHS
ID : R01 AI134384
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG006620
Pays : United States

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Références

Curr Protoc Mol Biol. 2010 Jan;Chapter 19:Unit 19.10.1-21
pubmed: 20069535
Genome Biol. 2010;11(8):R86
pubmed: 20738864
Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10
pubmed: 27137889

Auteurs

Anastasia Tyryshkina (A)

Huck Institute of Life Sciences, Neuroscience Program, The Pennsylvania State University, University Park, USA.

Nate Coraor (N)

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.

Anton Nekrutenko (A)

Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, USA.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Cephalometry Humans Anatomic Landmarks Software Internet

Classifications MeSH