Integrating statistical and visual analytic methods for bot identification of health-related survey data.

Humans COVID-19 / epidemiology Surveys and Questionnaires Software Data Accuracy Social Media

Bot identification COVID-19 Data quality Interactive visual analytics Questionnaire data Visual data analysis

Journal

Journal of biomedical informatics

ISSN: 1532-0480

Titre abrégé: J Biomed Inform

Pays: United States

ID NLM: 100970413

Informations de publication

Date de publication:
08 2023

Historique:

received: 17 01 2023

revised: 02 07 2023

accepted: 04 07 2023

medline: 11 8 2023

pubmed: 8 7 2023

entrez: 7 7 2023

Statut: ppublish

Résumé

In recent years, we have increasingly observed issues concerning quality of online information due to misinformation and disinformation. Aside from social media, there is growing awareness that questionnaire data collected using online recruitment methods may include suspect data provided by bots. Issues with data quality can be particularly problematic in health and/or biomedical contexts; thus, developing robust methods for suspect data identification and removal is of paramount importance in informatics. In this study, we describe an interactive visual analytics approach to suspect data identification and removal and demonstrate the application of this approach on questionnaire data pertaining to COVID-19 derived from different recruitment venues, including listservs and social media. We developed a pipeline for data cleaning, pre-processing, analysis, and automated ranking of data to address data quality issues. We then employed the ranking in conjunction with manual review to identify suspect data and remove them from subsequent analyses. Last, we compared differences in the data before and after removal. We performed data cleaning, pre-processing, and exploratory analysis on a survey dataset (N = 4,163) collected using multiple recruitment mechanins using the Qualtrics survey platform. Based on these results, we identified suspect features and used these to generate a suspect feature indicator for each survey response. We excluded survey responses that did not fit the inclusion criteria for the study (n = 29) and then performed manual review of the remaining responses, triangulating with the suspect feature indicator. Based on this review, we excluded 2,921 responses. Additional responses were excluded based on a spam classification by Qualtrics (n=13), and the percentage of survey completion (n=328), resulting in a final sample size of 872. We performed additional analyses to demonstrate the extent to which the suspect feature indicator was congruent with eventual inclusion, as well as compared the characteristics of the included and excluded data. Our main contributions are: 1) a proposed framework for data quality assessment, including suspect data identification and removal; 2) the analysis of potential consequences in terms of representation bias in the dataset; and 3) recommendations for implementation of this approach in practice.

Identifiants

DOI: 10.1016/j.jbi.2023.104439 PMID: 37419375

pubmed: 37419375

pii: S1532-0464(23)00160-0

doi: 10.1016/j.jbi.2023.104439

pii:

doi:

Types de publication

Review Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

104439

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Integrating statistical and visual analytic methods for bot identification of health-related survey data.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Auteurs

Annie T Chen (AT)

Midori Komi (M)

Sierrah Bessler (S)

Sean P Mikles (SP)

Yan Zhang (Y)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH