Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Vaugrante, Laurène; Carlon, Francesca; Menke, Maluna; Hagendorff, Thilo

Computer Science > Computation and Language

arXiv:2502.08301 (cs)

[Submitted on 12 Feb 2025]

Title:Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Authors:Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff

View PDF

Abstract:Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2502.08301 [cs.CL]
	(or arXiv:2502.08301v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.08301

Submission history

From: Thilo Hagendorff [view email]
[v1] Wed, 12 Feb 2025 11:02:59 UTC (704 KB)

Computer Science > Computation and Language

Title:Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators