Utilizing deep learning and graph mining to identify drug use on Twitter data.

BERT Big data Convolutional neural network Natural language processing Twitter analysis

Journal

BMC medical informatics and decision making
ISSN: 1472-6947
Titre abrégé: BMC Med Inform Decis Mak
Pays: England
ID NLM: 101088682

Informations de publication

Date de publication:
30 12 2020
Historique:
received: 11 11 2020
accepted: 16 11 2020
entrez: 31 12 2020
pubmed: 1 1 2021
medline: 20 2 2021
Statut: epublish

Résumé

The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. Social media data (tweets and attributes) were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset of 3,696,150 rows. The predictive classification power of multiple methods was compared including SVM, XGBoost, BERT and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. To test the predictive capability of the model, SVM and XGBoost were first employed. The results calculated from the models respectively displayed an accuracy of 59.33% and 54.90%, with AUC's of 0.87 and 0.71. The values show a low predictive capability with little discrimination. Conversely, the CNN-based classifiers presented a significant improvement, between the two models tested. The first was trained with 2661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as "smoke", "cocaine", and "marijuana" triggering a drug-positive classification. Predictive analysis with a CNN is promising, whereas attribute-based models presented little predictive capability and were not suitable for analyzing text of data. This research found that the commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased accuracy scores and improves the predictive capability.

Sections du résumé

BACKGROUND
The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined.
METHODS
Social media data (tweets and attributes) were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset of 3,696,150 rows. The predictive classification power of multiple methods was compared including SVM, XGBoost, BERT and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets.
RESULTS
To test the predictive capability of the model, SVM and XGBoost were first employed. The results calculated from the models respectively displayed an accuracy of 59.33% and 54.90%, with AUC's of 0.87 and 0.71. The values show a low predictive capability with little discrimination. Conversely, the CNN-based classifiers presented a significant improvement, between the two models tested. The first was trained with 2661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as "smoke", "cocaine", and "marijuana" triggering a drug-positive classification.
CONCLUSION
Predictive analysis with a CNN is promising, whereas attribute-based models presented little predictive capability and were not suitable for analyzing text of data. This research found that the commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased accuracy scores and improves the predictive capability.

Identifiants

pubmed: 33380324
doi: 10.1186/s12911-020-01335-3
pii: 10.1186/s12911-020-01335-3
pmc: PMC7772918
doi:

Substances chimiques

Pharmaceutical Preparations 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

304

Subventions

Organisme : Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada
ID : RGPIN-2017-05377

Références

PLoS One. 2013 Jul 09;8(7):e67863
pubmed: 23874456
PLoS One. 2010 Nov 29;5(11):e14118
pubmed: 21124761
Drug Saf. 2016 Mar;39(3):231-40
pubmed: 26748505
J Med Internet Res. 2015 Apr 20;17(4):e98
pubmed: 25895907
PLoS One. 2016 Jul 08;11(7):e0158450
pubmed: 27391760
JMIR Public Health Surveill. 2017 Sep 26;3(3):e63
pubmed: 28951381
Int Sch Res Notices. 2014 Nov 05;2014:923290
pubmed: 27437511
J Biomed Inform. 2013 Dec;46(6):985-97
pubmed: 23892295
Biochem Med (Zagreb). 2012;22(3):276-82
pubmed: 23092060
Front Public Health. 2020 Jan 14;7:400
pubmed: 31993412
J Dent Res. 2011 Sep;90(9):1047-51
pubmed: 21768306
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):43
pubmed: 30066665

Auteurs

Joseph Tassone (J)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada.

Peizhi Yan (P)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada.

Mackenzie Simpson (M)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada.

Chetan Mendhe (C)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada.

Vijay Mago (V)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada. vmago@lakeheadu.ca.

Salimur Choudhury (S)

Department of Computer Science, Lakehead University, 955 Oliver Road, Thunder Bay, P7B 5E1, Canada.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH