ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Xiao, Yunze; Hu, Yujia; Choo, Kenny Tsu Wei; Lee, Roy Ka-wei

Computer Science > Computation and Language

arXiv:2406.12223 (cs)

[Submitted on 18 Jun 2024]

Title:ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Authors:Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-wei Lee

View PDF HTML (experimental)

Abstract:Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.

Comments:	10 pages,5 Tables, 2 Figures
Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2406.12223 [cs.CL]
	(or arXiv:2406.12223v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.12223

Submission history

From: Yunze Xiao [view email]
[v1] Tue, 18 Jun 2024 02:44:56 UTC (2,848 KB)

Computer Science > Computation and Language

Title:ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators