Learning Stylometric Representations for Authorship Analysis.

Journal

IEEE transactions on cybernetics

ISSN: 2168-2275

Titre abrégé: IEEE Trans Cybern

Pays: United States

ID NLM: 101609393

Informations de publication

Date de publication:
Jan 2019

Historique:

pubmed: 11 7 2018

medline: 11 7 2018

entrez: 11 7 2018

Statut: ppublish

Résumé

Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

Identifiants

DOI: 10.1109/TCYB.2017.2766189 PMID: 29990260

pubmed: 29990260

doi: 10.1109/TCYB.2017.2766189

doi:

Types de publication

Journal Article

Langues

eng

Pagination

107-121

Learning Stylometric Representations for Authorship Analysis.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Pagination

Auteurs

Steven H H Ding (SHH)

Benjamin C M Fung (BCM)

Farkhund Iqbal (F)

William K Cheung (WK)

Classifications MeSH