Learning Stylometric Representations for Authorship Analysis

Ding, Steven H. H.; Fung, Benjamin C. M.; Iqbal, Farkhund; Cheung, William K.

Computer Science > Computation and Language

arXiv:1606.01219 (cs)

[Submitted on 3 Jun 2016]

Title:Learning Stylometric Representations for Authorship Analysis

Authors:Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung

View PDF

Abstract:Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations.

Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
ACM classes:	K.4.1; I.7.5; I.2.7
Cite as:	arXiv:1606.01219 [cs.CL]
	(or arXiv:1606.01219v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1606.01219

Submission history

From: Steven H. H. Ding [view email]
[v1] Fri, 3 Jun 2016 18:42:14 UTC (2,012 KB)

Computer Science > Computation and Language

Title:Learning Stylometric Representations for Authorship Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Learning Stylometric Representations for Authorship Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators