Linking Biomedical Data Warehouse Records With the National Mortality Database in France: Large-scale Matching Algorithm.

French National Mortality Database clinical data warehouse clinical informatics data reuse data warehousing medical informatics applications medical record linkage open data, R

Journal

JMIR medical informatics
ISSN: 2291-9694
Titre abrégé: JMIR Med Inform
Pays: Canada
ID NLM: 101645109

Informations de publication

Date de publication:
01 Nov 2022
Historique:
received: 21 01 2022
accepted: 11 04 2022
revised: 04 04 2022
entrez: 1 11 2022
pubmed: 2 11 2022
medline: 2 11 2022
Statut: epublish

Résumé

Often missing from or uncertain in a biomedical data warehouse (BDW), vital status after discharge is central to the value of a BDW in medical research. The French National Mortality Database (FNMD) offers open-source nominative records of every death. Matching large-scale BDWs records with the FNMD combines multiple challenges: absence of unique common identifiers between the 2 databases, names changing over life, clerical errors, and the exponential growth of the number of comparisons to compute. We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performance. We developed a deterministic algorithm based on advanced data cleaning and knowledge of the naming system and the Damerau-Levenshtein distance (DLD). The algorithm's performance was independently assessed using BDW data of 3 university hospitals: Lille, Nantes, and Rennes. Specificity was evaluated with living patients on January 1, 2016 (ie, patients with at least 1 hospital encounter before and after this date). Sensitivity was evaluated with patients recorded as deceased between January 1, 2001, and December 31, 2020. The DLD-based algorithm was compared to a direct matching algorithm with minimal data cleaning as a reference. All centers combined, sensitivity was 11% higher for the DLD-based algorithm (93.3%, 95% CI 92.8-93.9) than for the direct algorithm (82.7%, 95% CI 81.8-83.6; P<.001). Sensitivity was superior for men at 2 centers (Nantes: 87%, 95% CI 85.1-89 vs 83.6%, 95% CI 81.4-85.8; P=.006; Rennes: 98.6%, 95% CI 98.1-99.2 vs 96%, 95% CI 94.9-97.1; P<.001) and for patients born in France at all centers (Nantes: 85.8%, 95% CI 84.3-87.3 vs 74.9%, 95% CI 72.8-77.0; P<.001). The DLD-based algorithm revealed significant differences in sensitivity among centers (Nantes, 85.3% vs Lille and Rennes, 97.3%, P<.001). Specificity was >98% in all subgroups. Our algorithm matched tens of millions of death records from BDWs, with parallel computing capabilities and low RAM requirements. We used the Inseehop open-source R script for this measurement. Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than that using the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used to match any large-scale databases. While matching operations using names are considered sensitive computational operations, the Inseehop package released here is easy to run on premises, thereby facilitating compliance with cybersecurity local framework. The use of an advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combining open-source external data to improve the usage value of BDWs.

Sections du résumé

BACKGROUND BACKGROUND
Often missing from or uncertain in a biomedical data warehouse (BDW), vital status after discharge is central to the value of a BDW in medical research. The French National Mortality Database (FNMD) offers open-source nominative records of every death. Matching large-scale BDWs records with the FNMD combines multiple challenges: absence of unique common identifiers between the 2 databases, names changing over life, clerical errors, and the exponential growth of the number of comparisons to compute.
OBJECTIVE OBJECTIVE
We aimed to develop a new algorithm for matching BDW records to the FNMD and evaluated its performance.
METHODS METHODS
We developed a deterministic algorithm based on advanced data cleaning and knowledge of the naming system and the Damerau-Levenshtein distance (DLD). The algorithm's performance was independently assessed using BDW data of 3 university hospitals: Lille, Nantes, and Rennes. Specificity was evaluated with living patients on January 1, 2016 (ie, patients with at least 1 hospital encounter before and after this date). Sensitivity was evaluated with patients recorded as deceased between January 1, 2001, and December 31, 2020. The DLD-based algorithm was compared to a direct matching algorithm with minimal data cleaning as a reference.
RESULTS RESULTS
All centers combined, sensitivity was 11% higher for the DLD-based algorithm (93.3%, 95% CI 92.8-93.9) than for the direct algorithm (82.7%, 95% CI 81.8-83.6; P<.001). Sensitivity was superior for men at 2 centers (Nantes: 87%, 95% CI 85.1-89 vs 83.6%, 95% CI 81.4-85.8; P=.006; Rennes: 98.6%, 95% CI 98.1-99.2 vs 96%, 95% CI 94.9-97.1; P<.001) and for patients born in France at all centers (Nantes: 85.8%, 95% CI 84.3-87.3 vs 74.9%, 95% CI 72.8-77.0; P<.001). The DLD-based algorithm revealed significant differences in sensitivity among centers (Nantes, 85.3% vs Lille and Rennes, 97.3%, P<.001). Specificity was >98% in all subgroups. Our algorithm matched tens of millions of death records from BDWs, with parallel computing capabilities and low RAM requirements. We used the Inseehop open-source R script for this measurement.
CONCLUSIONS CONCLUSIONS
Overall, sensitivity/recall was 11% higher using the DLD-based algorithm than that using the direct algorithm. This shows the importance of advanced data cleaning and knowledge of a naming system through DLD use. Statistically significant differences in sensitivity between groups could be found and must be considered when performing an analysis to avoid differential biases. Our algorithm, originally conceived for linking a BDW with the FNMD, can be used to match any large-scale databases. While matching operations using names are considered sensitive computational operations, the Inseehop package released here is easy to run on premises, thereby facilitating compliance with cybersecurity local framework. The use of an advanced deterministic matching algorithm such as the DLD-based algorithm is an insightful example of combining open-source external data to improve the usage value of BDWs.

Identifiants

pubmed: 36318244
pii: v10i11e36711
doi: 10.2196/36711
pmc: PMC9667378
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e36711

Informations de copyright

©Vianney Guardiolle, Adrien Bazoge, Emmanuel Morin, Béatrice Daille, Delphine Toublant, Guillaume Bouzillé, Youenn Merel, Morgane Pierre-Jean, Alexandre Filiot, Marc Cuggia, Matthieu Wargny, Antoine Lamer, Pierre-Antoine Gourraud. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 01.11.2022.

Références

Yearb Med Inform. 2017 Aug;26(1):38-52
pubmed: 28480475
J R Stat Soc Ser A Stat Soc. 1990;153(3):287-320
pubmed: 12159128
JMIR Med Inform. 2021 Dec 13;9(12):e29286
pubmed: 34898457
Eur J Vasc Endovasc Surg. 2021 Oct;62(4):550-558
pubmed: 33846076
PLoS One. 2014 Jul 28;9(7):e103690
pubmed: 25068293
Stud Health Technol Inform. 2009;150:91-5
pubmed: 19745273

Auteurs

Vianney Guardiolle (V)

CHU de Nantes, INSERM CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, 44000, Nantes, France.

Adrien Bazoge (A)

CHU de Nantes, INSERM CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, 44000, Nantes, France.
LS2N UMR CNRS 6004, Université de Nantes - 2, rue de la Houssinière - BP 92208 - 44322 Nantes Cedex 03 - France, Nantes, France.

Emmanuel Morin (E)

LS2N UMR CNRS 6004, Université de Nantes - 2, rue de la Houssinière - BP 92208 - 44322 Nantes Cedex 03 - France, Nantes, France.

Béatrice Daille (B)

LS2N UMR CNRS 6004, Université de Nantes - 2, rue de la Houssinière - BP 92208 - 44322 Nantes Cedex 03 - France, Nantes, France.

Delphine Toublant (D)

CHU de Nantes, INSERM CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, 44000, Nantes, France.

Guillaume Bouzillé (G)

Univ Rennes, CHU Rennes, INSERM, LTSI-UMR 1099,35000, Rennes, France.

Youenn Merel (Y)

Univ Rennes, CHU Rennes, INSERM, LTSI-UMR 1099,35000, Rennes, France.

Morgane Pierre-Jean (M)

Univ Rennes, CHU Rennes, INSERM, LTSI-UMR 1099,35000, Rennes, France.

Alexandre Filiot (A)

CHU Lille, INCLUDE: Integration Center of the Lille University Hospital for Data Exploration, 59000, Lille, France.

Marc Cuggia (M)

Univ Rennes, CHU Rennes, INSERM, LTSI-UMR 1099,35000, Rennes, France.

Matthieu Wargny (M)

CHU de Nantes, INSERM CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, 44000, Nantes, France.

Antoine Lamer (A)

Univ. Lille, CHU Lille, ULR 2694, METRICS: Évaluation des Technologies de santé et des Pratiques médicales, F-59000, Lille, France.

Pierre-Antoine Gourraud (PA)

CHU de Nantes, INSERM CIC 1413, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, 44000, Nantes, France.
Université de Nantes, CHU de Nantes, INSERM, Centre de Recherche en Transplantation et Immunologie, UMR 1064, ATIP-Avenir, Nantes, France.

Classifications MeSH