Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study.
Feature selection
Omics data
Predictive model
Predictive performance
Study design
Validation
Journal
Radiation oncology (London, England)
ISSN: 1748-717X
Titre abrégé: Radiat Oncol
Pays: England
ID NLM: 101265111
Informations de publication
Date de publication:
14 May 2020
14 May 2020
Historique:
received:
19
11
2019
accepted:
22
04
2020
entrez:
16
5
2020
pubmed:
16
5
2020
medline:
23
3
2021
Statut:
epublish
Résumé
Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size. We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration. In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation. With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.
Sections du résumé
BACKGROUND
BACKGROUND
Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size.
METHODS
METHODS
We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration.
RESULTS
RESULTS
In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation.
CONCLUSIONS
CONCLUSIONS
With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.
Identifiants
pubmed: 32410693
doi: 10.1186/s13014-020-01543-1
pii: 10.1186/s13014-020-01543-1
pmc: PMC7227093
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
109Références
BMJ. 2012 Feb 14;344:e813
pubmed: 22334559
BMC Med Ethics. 2017 Mar 1;18(1):19
pubmed: 28249596
BMC Bioinformatics. 2016 Jan 12;17:27
pubmed: 26753519
Mol Oncol. 2019 Mar;13(3):535-542
pubmed: 30561127
Bioinformatics. 2017 Feb 1;33(3):397-404
pubmed: 27797760
Stat Med. 2000 Feb 29;19(4):453-73
pubmed: 10694730
Radiat Oncol. 2013 Dec 28;8:296
pubmed: 24373621
Stat Methods Med Res. 2018 Jun;27(6):1723-1736
pubmed: 27647815
Radiat Oncol. 2011 Nov 18;6:161
pubmed: 22099067
PLoS Genet. 2007 Sep;3(9):1724-35
pubmed: 17907809
PeerJ. 2018 Jan 2;6:e4204
pubmed: 29441228
Radiat Oncol. 2014 Jun 03;9:128
pubmed: 24893775
Breast Cancer (Dove Med Press). 2017 May 29;9:393-400
pubmed: 28615971
Nature. 2015 Jan 29;517(7536):576-82
pubmed: 25631445
J Stat Softw. 2010;33(1):1-22
pubmed: 20808728
Biostatistics. 2007 Jan;8(1):118-27
pubmed: 16632515
Radiat Oncol. 2018 Oct 1;13(1):193
pubmed: 30285791
Proc Natl Acad Sci U S A. 2005 Mar 8;102(10):3738-43
pubmed: 15701700
BMC Med Res Methodol. 2013 Mar 06;13:33
pubmed: 23496923
Clin Cancer Res. 2012 Sep 15;18(18):5134-43
pubmed: 22832933
BMJ. 2016 Jun 22;353:i3140
pubmed: 27334381
Radiat Oncol. 2011 Nov 10;6:153
pubmed: 22074483
BMJ. 2001 Jul 7;323(7303):42-6
pubmed: 11440947
Nat Rev Cancer. 2018 May;18(5):269-282
pubmed: 29497144
Proc Natl Acad Sci U S A. 2013 Apr 30;110(18):7413-7
pubmed: 23589849
N Engl J Med. 2016 Aug 25;375(8):717-29
pubmed: 27557300
PLoS One. 2012;7(12):e50938
pubmed: 23236413
Radiat Oncol. 2017 Jul 17;12(1):120
pubmed: 28716107
Oncotarget. 2016 Jul 19;7(29):45764-45775
pubmed: 27302927
Lancet Oncol. 2009 May;10(5):459-66
pubmed: 19269895
Onco Targets Ther. 2018 Jun 12;11:3415-3424
pubmed: 29928133
Lancet. 2005 Feb 5-11;365(9458):488-92
pubmed: 15705458
Brief Bioinform. 2013 Jul;14(4):469-90
pubmed: 22851511
Clin Cancer Res. 2018 Mar 15;24(6):1364-1374
pubmed: 29298797
Med Phys. 2018 Nov;45(11):e1111-e1122
pubmed: 30421807
Ann Intern Med. 2015 Jan 6;162(1):W1-73
pubmed: 25560730
J Clin Oncol. 2009 Mar 10;27(8):1160-7
pubmed: 19204204
Brief Bioinform. 2019 Nov 21;:
pubmed: 31750518
Stat Med. 2019 Aug 15;38(18):3444-3459
pubmed: 31148207
Br J Cancer. 2018 Aug;119(4):389-407
pubmed: 30061587
Clin Cancer Res. 2019 Mar 1;25(5):1505-1516
pubmed: 30171046
Radiat Oncol. 2014 Jan 11;9:21
pubmed: 24411063
Radiat Oncol. 2018 Jul 3;13(1):123
pubmed: 29970111
Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10869-74
pubmed: 11553815
Radiat Environ Biophys. 2014 Mar;53(1):1-29
pubmed: 24141602
Stat Methods Med Res. 2019 Apr;28(4):969-985
pubmed: 29157119
Cancer Manag Res. 2018 Dec 20;11:131-142
pubmed: 30588115
Epidemiology. 2010 Jan;21(1):128-38
pubmed: 20010215
Int J Cancer. 2018 Sep 15;143(6):1505-1515
pubmed: 29663366
Nat Rev Genet. 2010 Oct;11(10):733-9
pubmed: 20838408