Toxicity Detection towards Adaptability to Changing Perturbations

Kang, Hankun; Chen, Jianhao; Li, Yongqi; Miao, Xin; Xu, Mayi; Zhong, Ming; Zhu, Yuanyuan; Qian, Tieyun

Computer Science > Cryptography and Security

arXiv:2412.15267 (cs)

[Submitted on 17 Dec 2024]

Title:Toxicity Detection towards Adaptability to Changing Perturbations

Authors:Hankun Kang, Jianhao Chen, Yongqi Li, Xin Miao, Mayi Xu, Ming Zhong, Yuanyuan Zhu, Tieyun Qian

View PDF HTML (experimental)

Abstract:Toxicity detection is crucial for maintaining the peace of the society. While existing methods perform well on normal toxic contents or those generated by specific perturbation methods, they are vulnerable to evolving perturbation patterns. However, in real-world scenarios, malicious users tend to create new perturbation patterns for fooling the detectors. For example, some users may circumvent the detector of large language models (LLMs) by adding `I am a scientist' at the beginning of the prompt. In this paper, we introduce a novel problem, i.e., continual learning jailbreak perturbation patterns, into the toxicity detection field. To tackle this problem, we first construct a new dataset generated by 9 types of perturbation patterns, 7 of them are summarized from prior work and 2 of them are developed by us. We then systematically validate the vulnerability of current methods on this new perturbation pattern-aware dataset via both the zero-shot and fine tuned cross-pattern detection. Upon this, we present the domain incremental learning paradigm and the corresponding benchmark to ensure the detector's robustness to dynamically emerging types of perturbed toxic text. Our code and dataset are provided in the appendix and will be publicly available at GitHub, by which we wish to offer new research opportunities for the security-relevant communities.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2412.15267 [cs.CR]
	(or arXiv:2412.15267v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2412.15267

Submission history

From: Hankun Kang [view email]
[v1] Tue, 17 Dec 2024 05:04:57 UTC (984 KB)

Computer Science > Cryptography and Security

Title:Toxicity Detection towards Adaptability to Changing Perturbations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Toxicity Detection towards Adaptability to Changing Perturbations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators