Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration.
COVID-19
Common data model
Data harmonisation
Electronic health record
NHS digital TRE for England
Population health
Reproducible research
SAIL databank
Trusted Research Environments
Journal
BMC medical informatics and decision making
ISSN: 1472-6947
Titre abrégé: BMC Med Inform Decis Mak
Pays: England
ID NLM: 101088682
Informations de publication
Date de publication:
16 01 2023
16 01 2023
Historique:
received:
27
09
2022
accepted:
21
12
2022
entrez:
16
1
2023
pubmed:
17
1
2023
medline:
19
1
2023
Statut:
epublish
Résumé
The CVD-COVID-UK consortium was formed to understand the relationship between COVID-19 and cardiovascular diseases through analyses of harmonised electronic health records (EHRs) across the four UK nations. Beyond COVID-19, data harmonisation and common approaches enable analysis within and across independent Trusted Research Environments. Here we describe the reproducible harmonisation method developed using large-scale EHRs in Wales to accommodate the fast and efficient implementation of cross-nation analysis in England and Wales as part of the CVD-COVID-UK programme. We characterise current challenges and share lessons learnt. Serving the scope and scalability of multiple study protocols, we used linked, anonymised individual-level EHR, demographic and administrative data held within the SAIL Databank for the population of Wales. The harmonisation method was implemented as a four-layer reproducible process, starting from raw data in the first layer. Then each of the layers two to four is framed by, but not limited to, the characterised challenges and lessons learnt. We achieved curated data as part of our second layer, followed by extracting phenotyped data in the third layer. We captured any project-specific requirements in the fourth layer. Using the implemented four-layer harmonisation method, we retrieved approximately 100 health-related variables for the 3.2 million individuals in Wales, which are harmonised with corresponding variables for > 56 million individuals in England. We processed 13 data sources into the first layer of our harmonisation method: five of these are updated daily or weekly, and the rest at various frequencies providing sufficient data flow updates for frequent capturing of up-to-date demographic, administrative and clinical information. We implemented an efficient, transparent, scalable, and reproducible harmonisation method that enables multi-nation collaborative research. With a current focus on COVID-19 and its relationship with cardiovascular outcomes, the harmonised data has supported a wide range of research activities across the UK.
Sections du résumé
BACKGROUND
The CVD-COVID-UK consortium was formed to understand the relationship between COVID-19 and cardiovascular diseases through analyses of harmonised electronic health records (EHRs) across the four UK nations. Beyond COVID-19, data harmonisation and common approaches enable analysis within and across independent Trusted Research Environments. Here we describe the reproducible harmonisation method developed using large-scale EHRs in Wales to accommodate the fast and efficient implementation of cross-nation analysis in England and Wales as part of the CVD-COVID-UK programme. We characterise current challenges and share lessons learnt.
METHODS
Serving the scope and scalability of multiple study protocols, we used linked, anonymised individual-level EHR, demographic and administrative data held within the SAIL Databank for the population of Wales. The harmonisation method was implemented as a four-layer reproducible process, starting from raw data in the first layer. Then each of the layers two to four is framed by, but not limited to, the characterised challenges and lessons learnt. We achieved curated data as part of our second layer, followed by extracting phenotyped data in the third layer. We captured any project-specific requirements in the fourth layer.
RESULTS
Using the implemented four-layer harmonisation method, we retrieved approximately 100 health-related variables for the 3.2 million individuals in Wales, which are harmonised with corresponding variables for > 56 million individuals in England. We processed 13 data sources into the first layer of our harmonisation method: five of these are updated daily or weekly, and the rest at various frequencies providing sufficient data flow updates for frequent capturing of up-to-date demographic, administrative and clinical information.
CONCLUSIONS
We implemented an efficient, transparent, scalable, and reproducible harmonisation method that enables multi-nation collaborative research. With a current focus on COVID-19 and its relationship with cardiovascular outcomes, the harmonised data has supported a wide range of research activities across the UK.
Identifiants
pubmed: 36647111
doi: 10.1186/s12911-022-02093-0
pii: 10.1186/s12911-022-02093-0
pmc: PMC9842203
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
8Subventions
Organisme : The British Heart Foundation Data Science Centre
ID : SP/19/3/34678
Organisme : Medical Research Council
ID : MR/V028367/1
Pays : United Kingdom
Organisme : Administrative Data Research Wales
ID : ES/S007393/1
Organisme : Health Data Research UK
ID : HDR-9006
Informations de copyright
© 2023. The Author(s).
Références
Glob Health Action. 2022 Dec 31;15(1):2015743
pubmed: 35114900
Stud Health Technol Inform. 2021 May 27;281:709-713
pubmed: 34042668
NPJ Digit Med. 2020 Aug 19;3:109
pubmed: 32864472
Circulation. 2022 Sep 20;146(12):892-906
pubmed: 36121907
Neuroinformatics. 2022 Apr;20(2):377-390
pubmed: 34807353
Eur J Epidemiol. 2014 Dec;29(12):929-36
pubmed: 25504016
EGEMS (Wash DC). 2016 Jul 05;4(3):1232
pubmed: 27563686
Nat Commun. 2021 Oct 11;12(1):5910
pubmed: 34635645
BMC Public Health. 2018 Jan 19;18(1):158
pubmed: 29351781
BMJ. 2022 Aug 29;378:e069048
pubmed: 36562446
J Public Health (Oxf). 2021 Jun 7;43(2):e270-e272
pubmed: 33283239
Int J Popul Data Sci. 2022 Apr 28;5(4):1715
pubmed: 35677101
J Biomed Inform. 2014 Aug;50:196-204
pubmed: 24440148
J Radiol Prot. 2022 Sep 05;42(3):
pubmed: 35973413
J Intern Med. 2014 Jun;275(6):551-61
pubmed: 24635221
EGEMS (Wash DC). 2019 Mar 25;7(1):4
pubmed: 30937326
J Am Med Inform Assoc. 2015 Nov;22(6):1220-30
pubmed: 26342218
Int J Popul Data Sci. 2022 Feb 15;5(4):1697
pubmed: 35310465
JAMA Netw Open. 2021 Jun 1;4(6):e2112874
pubmed: 34115132
J Epidemiol Community Health. 2021 May;75(5):433-441
pubmed: 33184054
NPJ Digit Med. 2022 Jun 14;5(1):75
pubmed: 35701537
Int J Epidemiol. 2017 Feb 1;46(1):103-105
pubmed: 27272186
J Am Med Inform Assoc. 2022 Jun 14;29(7):1172-1182
pubmed: 35435957
Emerg Themes Epidemiol. 2013 Nov 21;10(1):12
pubmed: 24257327
Lancet Digit Health. 2022 Jul;4(7):e542-e557
pubmed: 35690576
BMJ Open. 2020 Oct 21;10(10):e043010
pubmed: 33087383
Spat Spatiotemporal Epidemiol. 2020 Nov;35:100361
pubmed: 33138954
Int J Popul Data Sci. 2021 Nov 30;6(1):1680
pubmed: 34888420
BMC Med Res Methodol. 2019 Jun 6;19(1):115
pubmed: 31170931
JAMA Netw Open. 2021 Jun 1;4(6):e2112596
pubmed: 34115127
Maturitas. 2016 Oct;92:176-185
pubmed: 27621257
J Am Med Inform Assoc. 2019 Dec 1;26(12):1545-1559
pubmed: 31329239
PLoS Med. 2022 Feb 22;19(2):e1003927
pubmed: 35192598
Lancet Digit Health. 2019 May 20;1(2):e63-e77
pubmed: 31650125
Pharmacoepidemiol Drug Saf. 2011 Jan;20(1):1-11
pubmed: 21182150
BMJ. 2021 Apr 7;373:n826
pubmed: 33827854
Int J Popul Data Sci. 2019 Nov 20;4(2):1134
pubmed: 34095541