Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines.

infodemiology infoveillance machine learning natural language processing prescription drug misuse social media substance abuse detection

Journal

Journal of medical Internet research
ISSN: 1438-8871
Titre abrégé: J Med Internet Res
Pays: Canada
ID NLM: 100959882

Informations de publication

Date de publication:
26 02 2020
Historique:
received: 13 08 2019
accepted: 15 12 2019
revised: 14 11 2019
entrez: 5 3 2020
pubmed: 5 3 2020
medline: 21 10 2020
Statut: epublish

Résumé

Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter. This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse-related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described. We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes-abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Sections du résumé

BACKGROUND
Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter.
OBJECTIVE
This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse-related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described.
METHODS
We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes-abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data.
RESULTS
Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).
CONCLUSIONS
Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Identifiants

pubmed: 32130117
pii: v22i2e15861
doi: 10.2196/15861
pmc: PMC7066507
doi:

Substances chimiques

Prescription Drugs 0

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

e15861

Subventions

Organisme : NIDA NIH HHS
ID : R01 DA046619
Pays : United States

Informations de copyright

©Karen O'Connor, Abeed Sarker, Jeanmarie Perrone, Graciela Gonzalez Hernandez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 26.02.2020.

Références

BMC Med Inform Decis Mak. 2015;15 Suppl 2:S6
pubmed: 26100267
Pain. 2015 Apr;156(4):569-76
pubmed: 25785523
Pharmacoepidemiol Drug Saf. 2012 Oct;21(10):1081-92
pubmed: 22777908
PLoS One. 2014 Aug 01;9(8):e103408
pubmed: 25084530
JMIR Public Health Surveill. 2017 Feb 1;3(1):e6
pubmed: 28148472
Drug Saf. 2016 Mar;39(3):231-40
pubmed: 26748505
PLoS Curr. 2014 Oct 28;6:
pubmed: 25642377
IEEE Trans Pattern Anal Mach Intell. 2019 Jun 12;:
pubmed: 31199253
J Biomed Inform. 2015 Apr;54:202-12
pubmed: 25720841
JMIR Public Health Surveill. 2018 Mar 20;4(1):e22
pubmed: 29559422
Proc Natl Acad Sci U S A. 2018 Oct 30;115(44):11203-11208
pubmed: 30322910
J Biomed Inform. 2015 Feb;53:196-207
pubmed: 25451103
J Ment Health. 2012 Aug;21(4):386-94
pubmed: 22823094
J Am Med Inform Assoc. 2018 Oct 1;25(10):1274-1283
pubmed: 30272184
J Addict Dis. 2015;34(4):303-10
pubmed: 26364675
AMIA Annu Symp Proc. 2018 Dec 05;2018:867-876
pubmed: 30815129
JMIR Public Health Surveill. 2016 Oct 24;2(2):e162
pubmed: 27777215
J Biomed Inform. 2018 Dec;88:98-107
pubmed: 30445220
Fam Med. 2005 May;37(5):360-3
pubmed: 15883903
J Med Internet Res. 2013 Sep 06;15(9):e189
pubmed: 24014109
Adv Neural Inf Process Syst. 2016 Dec;29:3567-3575
pubmed: 29872252
J Med Internet Res. 2016 Feb 26;18(2):e41
pubmed: 26920122
Data Brief. 2016 Nov 23;10:122-131
pubmed: 27981203
Front Pharmacol. 2018 Jul 26;9:791
pubmed: 30140224
JAMA. 2017 Oct 24;318(16):1537-1538
pubmed: 29049522
J Med Internet Res. 2013 Apr 17;15(4):e62
pubmed: 23594933

Auteurs

Karen O'Connor (K)

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Abeed Sarker (A)

Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.

Jeanmarie Perrone (J)

Department of Emergency Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Graciela Gonzalez Hernandez (G)

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH