Integrating statistical and visual analytic methods for bot identification of health-related survey data.

Bot identification COVID-19 Data quality Interactive visual analytics Questionnaire data Visual data analysis

Journal

Journal of biomedical informatics
ISSN: 1532-0480
Titre abrégé: J Biomed Inform
Pays: United States
ID NLM: 100970413

Informations de publication

Date de publication:
08 2023
Historique:
received: 17 01 2023
revised: 02 07 2023
accepted: 04 07 2023
medline: 11 8 2023
pubmed: 8 7 2023
entrez: 7 7 2023
Statut: ppublish

Résumé

In recent years, we have increasingly observed issues concerning quality of online information due to misinformation and disinformation. Aside from social media, there is growing awareness that questionnaire data collected using online recruitment methods may include suspect data provided by bots. Issues with data quality can be particularly problematic in health and/or biomedical contexts; thus, developing robust methods for suspect data identification and removal is of paramount importance in informatics. In this study, we describe an interactive visual analytics approach to suspect data identification and removal and demonstrate the application of this approach on questionnaire data pertaining to COVID-19 derived from different recruitment venues, including listservs and social media. We developed a pipeline for data cleaning, pre-processing, analysis, and automated ranking of data to address data quality issues. We then employed the ranking in conjunction with manual review to identify suspect data and remove them from subsequent analyses. Last, we compared differences in the data before and after removal. We performed data cleaning, pre-processing, and exploratory analysis on a survey dataset (N = 4,163) collected using multiple recruitment mechanins using the Qualtrics survey platform. Based on these results, we identified suspect features and used these to generate a suspect feature indicator for each survey response. We excluded survey responses that did not fit the inclusion criteria for the study (n = 29) and then performed manual review of the remaining responses, triangulating with the suspect feature indicator. Based on this review, we excluded 2,921 responses. Additional responses were excluded based on a spam classification by Qualtrics (n=13), and the percentage of survey completion (n=328), resulting in a final sample size of 872. We performed additional analyses to demonstrate the extent to which the suspect feature indicator was congruent with eventual inclusion, as well as compared the characteristics of the included and excluded data. Our main contributions are: 1) a proposed framework for data quality assessment, including suspect data identification and removal; 2) the analysis of potential consequences in terms of representation bias in the dataset; and 3) recommendations for implementation of this approach in practice.

Identifiants

pubmed: 37419375
pii: S1532-0464(23)00160-0
doi: 10.1016/j.jbi.2023.104439
pii:
doi:

Types de publication

Review Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

104439

Informations de copyright

Copyright © 2023 Elsevier Inc. All rights reserved.

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Auteurs

Annie T Chen (AT)

Department of Biomedical Informatics and Medical Education, University of Washington School of Medicine, 850 Republican St., Box 358047, Seattle, WA 98195, United States. Electronic address: atchen@uw.edu.

Midori Komi (M)

University of Washington, Department of Mathematics Box 354350, Seattle, WA 98195-4350, United States.

Sierrah Bessler (S)

University of Washington, Department of Applied Mathematics, 4182 W Stevens Way NE, Seattle, WA 98105, United States. Electronic address: sbessl@uw.edu.

Sean P Mikles (SP)

Lineberger Comprehensive Cancer Outcomes Program, Lineberger Comprehensive Cancer Center, UNC School of Medicine, 450 West Drive, Chapel Hill, NC 27514, United States.

Yan Zhang (Y)

School of Information, The University of Texas at Austin, 1616 Guadalupe Suite #5.202, Austin, TX 78701-1213, United States. Electronic address: yanz@utexas.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH