Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.


Journal

Bioinformatics advances
ISSN: 2635-0041
Titre abrégé: Bioinform Adv
Pays: England
ID NLM: 9918282081306676

Informations de publication

Date de publication:
2024
Historique:
received: 24 07 2023
revised: 09 11 2023
accepted: 13 01 2024
medline: 29 1 2024
pubmed: 29 1 2024
entrez: 29 1 2024
Statut: epublish

Résumé

Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers. After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated. The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein

Identifiants

pubmed: 38282973
doi: 10.1093/bioadv/vbae004
pii: vbae004
pmc: PMC10822580
doi:

Banques de données

figshare
['10.6084/m9.figshare.22728644']

Types de publication

Journal Article

Langues

eng

Pagination

vbae004

Informations de copyright

© The Author(s) 2024. Published by Oxford University Press.

Déclaration de conflit d'intérêts

None declared.

Auteurs

Jake Crawford (J)

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.

Maria Chikina (M)

Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.

Casey S Greene (CS)

Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, United States.
Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, United States.

Classifications MeSH