Integrating statistical and visual analytic methods for bot identification of health-related survey data.
Bot identification
COVID-19
Data quality
Interactive visual analytics
Questionnaire data
Visual data analysis
Journal
Journal of biomedical informatics
ISSN: 1532-0480
Titre abrégé: J Biomed Inform
Pays: United States
ID NLM: 100970413
Informations de publication
Date de publication:
08 2023
08 2023
Historique:
received:
17
01
2023
revised:
02
07
2023
accepted:
04
07
2023
medline:
11
8
2023
pubmed:
8
7
2023
entrez:
7
7
2023
Statut:
ppublish
Résumé
In recent years, we have increasingly observed issues concerning quality of online information due to misinformation and disinformation. Aside from social media, there is growing awareness that questionnaire data collected using online recruitment methods may include suspect data provided by bots. Issues with data quality can be particularly problematic in health and/or biomedical contexts; thus, developing robust methods for suspect data identification and removal is of paramount importance in informatics. In this study, we describe an interactive visual analytics approach to suspect data identification and removal and demonstrate the application of this approach on questionnaire data pertaining to COVID-19 derived from different recruitment venues, including listservs and social media. We developed a pipeline for data cleaning, pre-processing, analysis, and automated ranking of data to address data quality issues. We then employed the ranking in conjunction with manual review to identify suspect data and remove them from subsequent analyses. Last, we compared differences in the data before and after removal. We performed data cleaning, pre-processing, and exploratory analysis on a survey dataset (N = 4,163) collected using multiple recruitment mechanins using the Qualtrics survey platform. Based on these results, we identified suspect features and used these to generate a suspect feature indicator for each survey response. We excluded survey responses that did not fit the inclusion criteria for the study (n = 29) and then performed manual review of the remaining responses, triangulating with the suspect feature indicator. Based on this review, we excluded 2,921 responses. Additional responses were excluded based on a spam classification by Qualtrics (n=13), and the percentage of survey completion (n=328), resulting in a final sample size of 872. We performed additional analyses to demonstrate the extent to which the suspect feature indicator was congruent with eventual inclusion, as well as compared the characteristics of the included and excluded data. Our main contributions are: 1) a proposed framework for data quality assessment, including suspect data identification and removal; 2) the analysis of potential consequences in terms of representation bias in the dataset; and 3) recommendations for implementation of this approach in practice.
Identifiants
pubmed: 37419375
pii: S1532-0464(23)00160-0
doi: 10.1016/j.jbi.2023.104439
pii:
doi:
Types de publication
Review
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
104439Informations de copyright
Copyright © 2023 Elsevier Inc. All rights reserved.
Déclaration de conflit d'intérêts
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.