Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.


Journal

JMIR public health and surveillance
ISSN: 2369-2960
Titre abrégé: JMIR Public Health Surveill
Pays: Canada
ID NLM: 101669345

Informations de publication

Date de publication:
08 06 2020
Historique:
received: 21 04 2020
accepted: 03 06 2020
revised: 02 06 2020
pubmed: 4 6 2020
medline: 17 6 2020
entrez: 4 6 2020
Statut: epublish

Résumé

The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries. The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach. Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics. A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (<1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States. This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.

Sections du résumé

BACKGROUND
The coronavirus disease (COVID-19) pandemic is a global health emergency with over 6 million cases worldwide as of the beginning of June 2020. The pandemic is historic in scope and precedent given its emergence in an increasingly digital era. Importantly, there have been concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries.
OBJECTIVE
The aims of this study were to detect and characterize user-generated conversations that could be associated with COVID-19-related symptoms, experiences with access to testing, and mentions of disease recovery using an unsupervised machine learning approach.
METHODS
Tweets were collected from the Twitter public streaming application programming interface from March 3-20, 2020, filtered for general COVID-19-related keywords and then further filtered for terms that could be related to COVID-19 symptoms as self-reported by users. Tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters that included conversations about symptoms, testing, and recovery. Tweets in these clusters were then extracted and manually annotated for content analysis and assessed for their statistical and geographic characteristics.
RESULTS
A total of 4,492,954 tweets were collected that contained terms that could be related to COVID-19 symptoms. After using BTM to identify relevant topic clusters and removing duplicate tweets, we identified a total of 3465 (<1%) tweets that included user-generated conversations about experiences that users associated with possible COVID-19 symptoms and other disease experiences. These tweets were grouped into five main categories including first- and secondhand reports of symptoms, symptom reporting concurrent with lack of testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling symptoms and questioning whether they might have been previously infected with COVID-19. The co-occurrence of tweets for these themes was statistically significant for users reporting symptoms with a lack of testing and with a discussion of recovery. A total of 63% (n=1112) of the geotagged tweets were located in the United States.
CONCLUSIONS
This study used unsupervised machine learning for the purposes of characterizing self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19. Many users reported symptoms they thought were related to COVID-19, but they were not able to get tested to confirm their concerns. In the absence of testing availability and confirmation, accurate case estimations for this period of the outbreak may never be known. Future studies should continue to explore the utility of infoveillance approaches to estimate COVID-19 disease severity.

Identifiants

pubmed: 32490846
pii: v6i2e19509
doi: 10.2196/19509
pmc: PMC7282475
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e19509

Informations de copyright

©Tim Mackey, Vidya Purushothaman, Jiawei Li, Neal Shah, Matthew Nali, Cortni Bardier, Bryan Liang, Mingxiang Cai, Raphael Cuomo. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 08.06.2020.

Références

Ann Intern Med. 2020 May 19;172(10):699-701
pubmed: 32176272
J Med Internet Res. 2020 Apr 05;:
pubmed: 32250961
PLoS One. 2010 Nov 29;5(11):e14118
pubmed: 21124761
Science. 2020 May 8;368(6491):
pubmed: 32234805
JMIR Public Health Surveill. 2020 Apr 21;6(2):e18700
pubmed: 32293582
BMJ. 2020 Mar 20;368:m1113
pubmed: 32198267
Euro Surveill. 2020 Mar;25(10):
pubmed: 32183935
Addict Behav. 2017 Feb;65:289-295
pubmed: 27568339
Public Health. 2018 Dec;165:9-15
pubmed: 30342281
JAMA. 2020 Feb 28;:
pubmed: 32108857
JAMA Intern Med. 2020 Apr 7;:
pubmed: 32259192
Sci Data. 2020 Mar 24;7(1):106
pubmed: 32210236
JMIR Public Health Surveill. 2020 May 22;6(2):e19447
pubmed: 32412418
J Med Internet Res. 2009 Mar 27;11(1):e11
pubmed: 19329408
Infect Control Hosp Epidemiol. 2020 Apr 09;:1-3
pubmed: 32268929
J Med Internet Res. 2015 Apr 20;17(4):e98
pubmed: 25895907
Lancet Infect Dis. 2020 Jun;20(6):669-677
pubmed: 32240634
J Med Internet Res. 2020 Apr 21;22(4):e19016
pubmed: 32287039
Am J Public Health. 2017 Dec;107(12):1910-1915
pubmed: 29048960
Int J Health Geogr. 2020 Mar 11;19(1):8
pubmed: 32160889

Auteurs

Tim Mackey (T)

Department of Anesthesiology and Division of Global Public Health and Infectious Diseases, School of Medicine, University of California San Diego, La Jolla, CA, United States.
Global Health Policy Institute, San Diego, CA, United States.
S-3 Research LLC, San Diego, CA, United States.
Department of Healthcare Research and Policy, University of California San Diego, San Diego, CA, United States.

Vidya Purushothaman (V)

Global Health Policy Institute, San Diego, CA, United States.
Department of Family Medicine and Public Health, School of Medicine, University of California San Diego, La Jolla, CA, United States.

Jiawei Li (J)

Department of Anesthesiology and Division of Global Public Health and Infectious Diseases, School of Medicine, University of California San Diego, La Jolla, CA, United States.
Global Health Policy Institute, San Diego, CA, United States.
S-3 Research LLC, San Diego, CA, United States.
Department of Healthcare Research and Policy, University of California San Diego, San Diego, CA, United States.

Neal Shah (N)

Department of Anesthesiology and Division of Global Public Health and Infectious Diseases, School of Medicine, University of California San Diego, La Jolla, CA, United States.
Department of Healthcare Research and Policy, University of California San Diego, San Diego, CA, United States.

Matthew Nali (M)

Department of Anesthesiology and Division of Global Public Health and Infectious Diseases, School of Medicine, University of California San Diego, La Jolla, CA, United States.
S-3 Research LLC, San Diego, CA, United States.

Cortni Bardier (C)

Masters Program in Global Health, Department of Anthropology, University of California San Diego, La Jolla, CA, United States.

Bryan Liang (B)

Global Health Policy Institute, San Diego, CA, United States.
S-3 Research LLC, San Diego, CA, United States.

Mingxiang Cai (M)

Global Health Policy Institute, San Diego, CA, United States.
S-3 Research LLC, San Diego, CA, United States.
Masters Program in Computer Science, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, United States.

Raphael Cuomo (R)

Department of Anesthesiology and Division of Global Public Health and Infectious Diseases, School of Medicine, University of California San Diego, La Jolla, CA, United States.
Global Health Policy Institute, San Diego, CA, United States.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH