SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Yang, Yan; Xiao, Zeguan; Lu, Xin; Wang, Hongru; Huang, Hailiang; Chen, Guanhua; Chen, Yun

Computer Science > Cryptography and Security

arXiv:2407.01902 (cs)

[Submitted on 2 Jul 2024]

Title:SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Authors:Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen

View PDF

Abstract:The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at this https URL.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2407.01902 [cs.CR]
	(or arXiv:2407.01902v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2407.01902

Submission history

From: Yan Yang [view email]
[v1] Tue, 2 Jul 2024 02:58:29 UTC (904 KB)

Computer Science > Cryptography and Security

Title:SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators