STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Wang, Xunguang; Wang, Wenxuan; Ji, Zhenlan; Li, Zongjie; Ma, Pingchuan; Wu, Daoyuan; Wang, Shuai

Computer Science > Computation and Language

arXiv:2503.17932 (cs)

[Submitted on 23 Mar 2025]

Title:STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Authors:Xunguang Wang, Wenxuan Wang, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Comments:	11 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Cite as:	arXiv:2503.17932 [cs.CL]
	(or arXiv:2503.17932v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.17932

Submission history

From: Xunguang Wang [view email]
[v1] Sun, 23 Mar 2025 04:23:07 UTC (284 KB)

Computer Science > Computation and Language

Title:STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators