Learning curves for drug response prediction in cancer cell lines.

Cell line Deep learning Drug response prediction Learning curve Machine learning Power law

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
17 May 2021
Historique:
received: 29 11 2020
accepted: 04 05 2021
entrez: 18 5 2021
pubmed: 19 5 2021
medline: 20 5 2021
Statut: epublish

Résumé

Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data. We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models. The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics. A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.

Sections du résumé

BACKGROUND BACKGROUND
Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data.
METHODS METHODS
We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models.
RESULTS RESULTS
The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics.
CONCLUSIONS CONCLUSIONS
A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.

Identifiants

pubmed: 34001007
doi: 10.1186/s12859-021-04163-y
pii: 10.1186/s12859-021-04163-y
pmc: PMC8130157
doi:

Substances chimiques

Pharmaceutical Preparations 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

252

Références

Mol Pharm. 2019 Dec 2;16(12):4797-4806
pubmed: 31618586
BMC Med Genomics. 2019 Jan 31;12(Suppl 1):18
pubmed: 30704458
Nat Commun. 2020 Sep 1;11(1):4391
pubmed: 32873806
Pac Symp Biocomput. 2014;:63-74
pubmed: 24297534
NPJ Precis Oncol. 2020 Jun 15;4:19
pubmed: 32566759
Nat Rev Cancer. 2010 Apr;10(4):241-53
pubmed: 20300105
BMC Bioinformatics. 2018 Dec 28;19(Suppl 17):497
pubmed: 30591023
Sci Rep. 2020 Oct 22;10(1):18040
pubmed: 33093487
Nat Biotechnol. 2014 Dec;32(12):1202-12
pubmed: 24880487
BMC Bioinformatics. 2018 Dec 21;19(Suppl 18):486
pubmed: 30577754
J Comput Biol. 2003;10(2):119-42
pubmed: 12804087
Pharmacol Ther. 2019 Nov;203:107395
pubmed: 31374225
Cell. 2017 Nov 30;171(6):1437-1452.e17
pubmed: 29195078
Genes (Basel). 2020 Sep 11;11(9):
pubmed: 32933072
Semin Oncol. 1992 Dec;19(6):622-38
pubmed: 1462164
Nucleic Acids Res. 2013 Jan;41(Database issue):D955-61
pubmed: 23180760
J Cheminform. 2018 Feb 06;10(1):4
pubmed: 29411163
Bioinformatics. 2019 Oct 1;35(19):3743-3751
pubmed: 30850846
Bioinformatics. 2019 Jul 15;35(14):i501-i509
pubmed: 31510700
Cancer Discov. 2015 Nov;5(11):1210-23
pubmed: 26482930
BMC Med Inform Decis Mak. 2012 Feb 15;12:8
pubmed: 22336388
J Cheminform. 2019 Jun 19;11(1):41
pubmed: 31218493
Nature. 2018 Aug;560(7718):325-330
pubmed: 30089904
J Natl Cancer Inst. 2013 Apr 3;105(7):452-8
pubmed: 23434901

Auteurs

Alexander Partin (A)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA. apartin@anl.gov.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA. apartin@anl.gov.

Thomas Brettin (T)

University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.
Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA.

Yvonne A Evrard (YA)

Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD, USA.

Yitan Zhu (Y)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.

Hyunseung Yoo (H)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.

Fangfang Xia (F)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.

Songhao Jiang (S)

Department of Computer Science, University of Chicago, Chicago, IL, USA.

Austin Clyde (A)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
Department of Computer Science, University of Chicago, Chicago, IL, USA.

Maulik Shukla (M)

Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.

Michael Fonstein (M)

Biosciences Division, Argonne National Laboratory, Lemont, IL, USA.

James H Doroshow (JH)

Division of Cancer Therapeutics and Diagnosis, National Cancer Institute, Bethesda, MD, USA.

Rick L Stevens (RL)

Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA.
Department of Computer Science, University of Chicago, Chicago, IL, USA.

Articles similaires

Humans Pharmaceutical Preparations Drug Utilization Prescription Drugs
Humans Male Female Health Knowledge, Attitudes, Practice Middle Aged
Humans Robotic Surgical Procedures Male Female Aged
Humans Hernias, Diaphragmatic, Congenital Case-Control Studies Prospective Studies Sweden

Classifications MeSH