Application of Bayesian networks to generate synthetic health data.
Bayesian networks
data dissemination, disclosure risk
health data
synthetic data
Journal
Journal of the American Medical Informatics Association : JAMIA
ISSN: 1527-974X
Titre abrégé: J Am Med Inform Assoc
Pays: England
ID NLM: 9430800
Informations de publication
Date de publication:
18 03 2021
18 03 2021
Historique:
received:
17
08
2020
accepted:
16
11
2020
pubmed:
29
12
2020
medline:
24
8
2021
entrez:
28
12
2020
Statut:
ppublish
Résumé
This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.
Identifiants
pubmed: 33367620
pii: 6046159
doi: 10.1093/jamia/ocaa303
pmc: PMC7973486
doi:
Banques de données
Dryad
['10.5061/dryad.ttdz08kws']
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
801-811Informations de copyright
© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Références
BMC Bioinformatics. 2012;13 Suppl 15:S14
pubmed: 23046392
J Am Med Inform Assoc. 2018 Oct 1;25(10):1419-1428
pubmed: 29893864
Future Healthc J. 2019 Jun;6(2):94-98
pubmed: 31363513
Am J Bioeth. 2010 Sep;10(9):3-11
pubmed: 20818545
J Am Med Inform Assoc. 2018 Mar 1;25(3):230-238
pubmed: 29025144
Stud Health Technol Inform. 2007;129(Pt 1):664-8
pubmed: 17911800
Stud Health Technol Inform. 2020 Jun 26;272:322-325
pubmed: 32604667
J Am Stat Assoc. 2012 Jan 1;104(487):1042-1051
pubmed: 23606777
J Am Med Inform Assoc. 2013 Dec;20(e2):e267-74
pubmed: 23886921
Biomed Res Int. 2014;2014:781670
pubmed: 24804245
Clin Res Cardiol. 2017 Jan;106(1):1-9
pubmed: 27557678
J Am Med Inform Assoc. 2016 May;23(3):553-61
pubmed: 26374704
JMIR Med Inform. 2020 Feb 20;8(2):e16492
pubmed: 32130148
J Am Med Inform Assoc. 2019 Mar 1;26(3):228-241
pubmed: 30535151
PLoS One. 2011;6(12):e28071
pubmed: 22164229
Eur J Public Health. 2015 Oct;25(5):757-8
pubmed: 26265364
J Biomed Inform. 2018 Dec;88:1-10
pubmed: 30399432
BMC Med Res Methodol. 2020 May 7;20(1):108
pubmed: 32381039
PLoS Med. 2018 Nov 6;15(11):e1002689
pubmed: 30399149
BMC Med Inform Decis Mak. 2019 Mar 14;19(1):44
pubmed: 30871520
J Am Med Inform Assoc. 2014 Jan-Feb;21(1):8-12
pubmed: 23966483
Am J Cardiol. 1989 Aug 1;64(5):304-10
pubmed: 2756873