A little goes a long way: Improving toxic language classification despite data scarcity

Juuti, Mika; Gröndahl, Tommi; Flanagan, Adrian; Asokan, N.

Computer Science > Computation and Language

arXiv:2009.12344 (cs)

[Submitted on 25 Sep 2020 (v1), last revised 24 Oct 2020 (this version, v2)]

Title:A little goes a long way: Improving toxic language classification despite data scarcity

Authors:Mika Juuti, Tommi Gröndahl, Adrian Flanagan, N. Asokan

View PDF

Abstract:Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation - generating new synthetic data from a labeled seed dataset - can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT - a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.

Comments:	To appear in Findings of ACL: EMNLP 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2009.12344 [cs.CL]
	(or arXiv:2009.12344v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2009.12344

Submission history

From: Tommi Gröndahl [view email]
[v1] Fri, 25 Sep 2020 17:04:17 UTC (7,292 KB)
[v2] Sat, 24 Oct 2020 19:31:34 UTC (7,292 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-09

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Mika Juuti
Tommi Gröndahl
Adrian Flanagan
N. Asokan

export BibTeX citation

Computer Science > Computation and Language

Title:A little goes a long way: Improving toxic language classification despite data scarcity

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A little goes a long way: Improving toxic language classification despite data scarcity

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators