Synthetic data for privacy-preserving clinical risk prediction.


Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
27 Oct 2024
Historique:
received: 16 04 2024
accepted: 11 09 2024
medline: 28 10 2024
pubmed: 28 10 2024
entrez: 28 10 2024
Statut: epublish

Résumé

Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches-such as federated learning-analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.

Identifiants

pubmed: 39463411
doi: 10.1038/s41598-024-72894-y
pii: 10.1038/s41598-024-72894-y
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

25676

Subventions

Organisme : Cancer Research UK
ID : EICEDAAP\100012
Pays : United Kingdom

Informations de copyright

© 2024. The Author(s).

Références

European Union. 2018. General data protection regulation (GDPR). https://gdpr.eu/tag/gdpr/ . Accessed 22 Nov 2022.
U.S. Department of Health and Human Services - Office for Civil Rights. (2021). Health insurance portability and accountability act of 1996 (HIPAA). https://www.hhs.gov/hipaa/index.html . Accessed 14 Nov 2022.
Blasimme, A., Fadda, M., Schneider, M. & Vayena, E. Data sharing for precision medicine: Policy lessons and future directions. Health Aff. 37, 702–709. https://doi.org/10.1377/hlthaff.2017.1558 (2018).
doi: 10.1377/hlthaff.2017.1558
Ursin, G. et al. Sharing data safely while preserving privacy. Lancet 394, 1902. https://doi.org/10.1016/S0140-6736(19)32603-0 (2019).
doi: 10.1016/S0140-6736(19)32603-0 pubmed: 31708191
Mascalzoni, D. et al. Are requirements to deposit data in research repositories compatible with the European union’s general data protection regulation?. Ann. Intern. Med. 170, 332–334. https://doi.org/10.7326/M18-2854 (2019).
doi: 10.7326/M18-2854 pubmed: 30776795
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. & Vilhuber, L. Privacy: Theory meets practice on the map. In 2008 IEEE 24th international conference on data engineering, 277–286 (IEEE, 2008).
El Emam, K., Mosquera, L. & Hoptroff, R. Practical synthetic data generation: Balancing privacy and the broad availability of data (O’Reilly Media, 2020).
Wei, K. et al. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 15, 3454–3469 (2020).
doi: 10.1109/TIFS.2020.2988575
Mothukuri, V. et al. A survey on security and privacy of federated learning. Futur. Gener. Comput. Syst. 115, 619–640 (2021).
doi: 10.1016/j.future.2020.10.007
Tukey, J. W. et al. Exploratory data analysis Vol. 2 (Reading, 1977).
Jordon, J., Yoon, J. & van der Schaar, M. Measuring the quality of synthetic data for use in competitions. In KDD Workshop on Machine Learning for Medicine and Healthcare (2018).
Abowd, J. M. & Vilhuber, L. How protective are synthetic data? In Privacy in Statistical Databases: UNESCO Chair in Data Privacy International Conference, PSD 2008, Istanbul, Turkey, September 24-26, 2008. Proceedings, 239–246 (Springer, 2008).
Assefa, S. A. et al. Generating synthetic data in finance: Opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, 1–8 (2020).
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J. & Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, vol. 267 (2019).
van den Burg, G. & Williams, C. On memorization in probabilistic deep generative models. Adv. Neural Inf. Process. Syst. 34, 27916–27928 (2021).
Koller, D. & Friedman, N. Probabilistic graphical models: Principles and techniques (MIT press, 2009).
Bond-Taylor, S., Leach, A., Long, Y. & Willcocks, C. G. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. In IEEE transactions on pattern analysis and machine intelligence (2021).
Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D. & Xiao, X. PrivBayes: Private data release via Bayesian networks. ACM Trans. Database Syst. 42, 1–41. https://doi.org/10.1145/3134428 (2017).
doi: 10.1145/3134428
Xie, L., Lin, K., Wang, S., Wang, F. & Zhou, J. Differentially private generative adversarial network. Preprint at arXiv: 1802.06739 (2018).
Yoon, J., Jordon, J. & van der Schaar, M. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations (2019).
Yoon, J., Drumright, L. N. & van der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inform. 24, 2378–2388. https://doi.org/10.1109/JBHI.2020.2980262 (2020).
doi: 10.1109/JBHI.2020.2980262 pubmed: 32167919
Wang, Z., Myles, P. & Tucker, A. Generating and evaluating synthetic UK primary care data: Preserving data utility & patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), 126–131 (IEEE, 2019).
Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digit. Med. 3, 1–13 (2020).
doi: 10.1038/s41746-020-00353-9
Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20, 1–40 (2020).
doi: 10.1186/s12874-020-00977-1
Wang, Z., Myles, P. & Tucker, A. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Comput. Intell. 37, 819–851 (2021).
doi: 10.1111/coin.12427
Kokosi, T. & Harron, K. Synthetic data in medical research. BMJ Med. 1, e000167 (2022).
doi: 10.1136/bmjmed-2022-000167 pubmed: 36936569 pmcid: 9951365
Esteban, C., Hyland, S. L. & Rätsch, G. Real-valued (medical) time series generation with recurrent conditional gans. Preprint at arXiv:1706.02633 (2017).
Hittmeir, M., Ekelhart, A. & Mayer, R. On the utility of synthetic data: An empirical evaluation on machine learning tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security, 1–6 (2019).
El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18, 56–59 (2020).
doi: 10.1109/MSEC.2020.2992821
James, S., Harbron, C., Branson, J. & Sundler, M. Synthetic data use: Exploring use cases to optimise data utility. Discov. Artif. Intell. 1, 15 (2021).
doi: 10.1007/s44163-021-00016-y
Pereira, M., Kshirsagar, M., Mukherjee, S., Dodhia, R. & Ferres, J. L. An analysis of the deployment of models trained on private tabular synthetic data: Unexpected surprises. Preprint at arXiv:2106.10241 (2021).
Ganev, G., Oprisanu, B. & De Cristofaro, E. Robin hood and Matthew effects: Differential privacy has disparate impact on synthetic data. In International Conference on Machine Learning, 6944–6959 (PMLR, 2022).
Sudlow, C. et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779. https://doi.org/10.1371/journal.pmed.1001779 (2015).
doi: 10.1371/journal.pmed.1001779 pubmed: 25826379 pmcid: 4380465
Differential privacy, vol. 2006 (ICALP, 2006).
Goodfellow, I. et al. Generative adversarial nets. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z. et al.) (Curran Associates, Inc., 2014).
Alaa, A., Van Breugel, B., Saveliev, E. S. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In Proceedings of the 39th International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 290–306 (PMLR, 2022).
Lorenzo-Seva, U. How to report the percentage of explained common variance in exploratory factor analysis (Department of Psychology, 2013).
Arthur, D. & Vassilvitskii, S. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035 (2007).
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
doi: 10.1214/aos/1176344136
Toumazis, I., Bastani, M., Han, S. S. & Plevritis, S. K. Risk-Based lung cancer screening: A systematic review. Lung Cancer 147, 154–186. https://doi.org/10.1016/j.lungcan.2020.07.007 (2020).
doi: 10.1016/j.lungcan.2020.07.007 pubmed: 32721652
Lee, C., Zame, W., Yoon, J. & van der Schaar, M. DeepHit: A deep learning approach to survival analysis with competing risks. In AAAI Vol. 32 https://doi.org/10.1609/aaai.v32i1.11842 (2018).
Katzman, J. L. et al. Deepsurv: Personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Med. Res. Methodol. 18, 1–12 (2018).
doi: 10.1186/s12874-018-0482-1
Nagpal, C., Yadlowsky, S., Rostamzadeh, N. & Heller, K. Deep cox mixtures for survival regression. In Machine Learning for Healthcare Conference, 674–708 (PMLR, 2021).
Hu, H. et al. Membership inference attacks on machine learning: A survey. ACM Comput. Surv. (CSUR) 54, 1–37 (2022).
doi: 10.1145/3523273
El Emam, K., Jonker, E., Arbuckle, L. & Malin, B. A systematic review of re-identification attacks on health data. PloS one 6, e28071 (2011).
doi: 10.1371/journal.pone.0028071 pubmed: 22164229 pmcid: 3229505
Henriksen-Bulmer, J. & Jeary, S. Re-identification attacks-a systematic literature review. Int. J. Inf. Manag. 36, 1184–1192 (2016).
doi: 10.1016/j.ijinfomgt.2016.08.002
Merener, M. M. Theoretical results on de-anonymization via linkage attacks. Trans. Data Priv. 5, 377–402 (2012).
Harmanci, A. & Gerstein, M. Quantification of private information leakage from phenotype-genotype data: Linking attacks. Nat. Methods 13, 251–256 (2016).
doi: 10.1038/nmeth.3746 pubmed: 26828419 pmcid: 4834871
Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 571–588 (2002).
doi: 10.1142/S021848850200165X
van Breugel, B., Qian, Z. & van der Schaar, M. Synthetic data, real errors: How (not) to publish and use synthetic data. In International Conference on Learning Representations (2023).
Xie, L., Lin, K., Wang, S., Wang, F. & Zhou, J. Differentially private generative adversarial network. Preprint at arXiv:1802.06739 (2018).
Jordon, J., Yoon, J. & Van Der Schaar, M. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations (2018).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, 214–223 (PMLR, 2017).
Alaa, A., Van Breugel, B., Saveliev, E. S. & van der Schaar, M. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, 290–306 (PMLR, 2022).
Wold, S., Esbensen, K. & Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 2, 37–52 (1987).
doi: 10.1016/0169-7439(87)80084-9
Tipping, M. E. & Bishop, C. M. Mixtures of probabilistic principal component analyzers. Neural Comput. 11, 443–482 (1999).
doi: 10.1162/089976699300016728 pubmed: 9950739
Neath, A. A. & Cavanaugh, J. E. The Bayesian information criterion: Background, derivation, and applications. Wiley Interdiscip. Rev. Comput. Stat. 4, 199–203 (2012).
doi: 10.1002/wics.199
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
doi: 10.1007/BF01908075
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, 1073–1080 (2009).
Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L.-J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat. Med. 30, 1105–1117 (2011).
doi: 10.1002/sim.4154 pubmed: 21484848 pmcid: 3079915
Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23, 40–55 (2022).
doi: 10.1038/s41580-021-00407-0 pubmed: 34518686
Graf, E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18, 2529–2545 (1999).
doi: 10.1002/(SICI)1097-0258(19990915/30)18:17/18<2529::AID-SIM274>3.0.CO;2-5 pubmed: 10474158
Van Breugel, B., Sun, H., Qian, Z. & van der Schaar, M. Membership inference attacks against synthetic data through overfitting detection. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023 Vol. 162 (PMLR, 2023).
Snoke, J., Raab, G. M., Nowok, B., Dibben, C. & Slavkovic, A. General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181, 663–688 (2018).
doi: 10.1111/rssa.12358
Lopez-Paz, D. & Oquab, M. Revisiting classifier two-sample tests. Preprint at arXiv:1610.06545 (2016).
Arnold, C. & Neunhoeffer, M. Really useful synthetic data–a framework to evaluate the quality of differentially private synthetic data. Preprint at arXiv:2004.07740 (2020).

Auteurs

Zhaozhi Qian (Z)

University of Cambridge, Cambridge, CB2 1TN, UK.

Thomas Callender (T)

University College London, London, WC1E 6BT, UK.

Bogdan Cebere (B)

University of Cambridge, Cambridge, CB2 1TN, UK.

Sam M Janes (SM)

University College London, London, WC1E 6BT, UK.

Neal Navani (N)

University College London, London, WC1E 6BT, UK.

Mihaela van der Schaar (M)

University of Cambridge, Cambridge, CB2 1TN, UK. mv472@cam.ac.uk.
The Alan Turing Institute, London, NW1 2DB, UK. mv472@cam.ac.uk.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH