Assessment of the Consistency of Categorical Features Within the DZHK Biobanking Basic Set.

Biological specimen bank Data quality metadata

Journal

Studies in health technology and informatics
ISSN: 1879-8365
Titre abrégé: Stud Health Technol Inform
Pays: Netherlands
ID NLM: 9214582

Informations de publication

Date de publication:
17 Aug 2022
Historique:
entrez: 8 9 2022
pubmed: 9 9 2022
medline: 11 9 2022
Statut: ppublish

Résumé

Data quality in health research encompasses a broad range of aspects and indicators. While some indicators are generic and can be calculated without domain knowledge, others require information about a specific data element. Even more complex are indicators addressing contradictions, that stem from implausible combinations of multiple data elements. In this paper, we investigate how contradictions within interdependent categorical data can be identified and if they give additional information about possible quality issues, their cause, and mitigation options. The 19 data elements that represent four biosample types including their pre-analytic states within the DZHK Biobanking basic set are exported to the CDISC Operational Data Model (ODM), transformed and loaded into a tranSMART instance. Through the implementation of a data quality assessment workflow as a SmartR plug-in, statistical information about the domain-specific consistency of interdependent values are retrieved, assessed, and visualized. Data quality indicators have been selected for the assessment according to common recommendations found in the literature. Different contradictions could be discovered in the dataset including mismatch of interdependent values in the pre-analytic states of blood and urine samples, as well as primary and aliquoted samples. The overall assessment rating shows that 99.61% of the interdependent values are free of contradictions. However, measures within the EDC design to avoid contradictions may result in overestimated missing rates in automatic, item-based quality assessment checks. Through consistency checks on interdependent categorical features, we demonstrated that consistency flaws can be found in the categorical data of biobanking metadata and that they can help to detect issues in the data entry process. Our approach underscores the importance of domain knowledge in the definition of the consistency rules but also knowledge about the EDC implementation of such consistency rules to consider the impact on item-based quality indicators.

Identifiants

pubmed: 36073494
pii: SHTI220809
doi: 10.3233/SHTI220809
doi:

Types de publication

Journal Article

Langues

eng

Pagination

98-106

Auteurs

Khalid Yusuf (K)

Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany.

Kais Tahar (K)

Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany.

Ulrich Sax (U)

Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany.
Campus Institute Data Science (CIDAS), Georg August-University, Göttingen, Germany.

Wolfgang Hoffmann (W)

DZHK (German Centre for Cardiovascular Research).
Institute for Community Medicine, Department Epidemiology of Health Care and Community Health, University Medicine Greifswald, Germany.

Dagmar Krefting (D)

Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany.
Campus Institute Data Science (CIDAS), Georg August-University, Göttingen, Germany.
DZHK (German Centre for Cardiovascular Research).

Articles similaires

Humans COVID-19 United Kingdom Surveys and Questionnaires Biological Specimen Banks

Synthetic data for privacy-preserving clinical risk prediction.

Zhaozhi Qian, Thomas Callender, Bogdan Cebere et al.
1.00
Humans Privacy Lung Neoplasms Prognosis Biological Specimen Banks
Protein Kinases Computational Biology Software Workflow Proteome

Classifications MeSH