Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Liang, Haoyu; Sun, Youran; Cai, Yunfeng; Zhu, Jun; Zhang, Bo

Computer Science > Computation and Language

arXiv:2501.18280 (cs)

[Submitted on 30 Jan 2025]

Title:Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Authors:Haoyu Liang, Youran Sun, Yunfeng Cai, Jun Zhu, Bo Zhang

View PDF HTML (experimental)

Abstract:The security issue of large language models (LLMs) has gained significant attention recently, with various defense mechanisms developed to prevent harmful outputs, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the distribution of text embedding model outputs is significantly biased with a large mean. Inspired by this observation, we propose novel efficient methods to search for universal magic words that can attack text embedding models. The universal magic words as suffixes can move the embedding of any text towards the bias direction, therefore manipulate the similarity of any text pair and mislead safeguards. By appending magic words to user prompts and requiring LLMs to end answers with magic words, attackers can jailbreak the safeguard. To eradicate this security risk, we also propose defense mechanisms against such attacks, which can correct the biased distribution of text embeddings in a train-free manner.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:2501.18280 [cs.CL]
	(or arXiv:2501.18280v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.18280

Submission history

From: Haoyu Liang [view email]
[v1] Thu, 30 Jan 2025 11:37:40 UTC (975 KB)

Computer Science > Computation and Language

Title:Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators