Is poisoning a real threat to LLM alignment? Maybe more so than you think

Pathmanathan, Pankayaraj; Chakraborty, Souradip; Liu, Xiangyu; Liang, Yongyuan; Huang, Furong

Computer Science > Machine Learning

arXiv:2406.12091 (cs)

[Submitted on 17 Jun 2024 (v1), last revised 19 Jun 2024 (this version, v2)]

Title:Is poisoning a real threat to LLM alignment? Maybe more so than you think

Authors:Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

View PDF HTML (experimental)

Abstract:Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2406.12091 [cs.LG]
	(or arXiv:2406.12091v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.12091
Journal reference:	ICML 2024 Workshop MHFAIA

Submission history

From: Pankayaraj Pathmanathan [view email]
[v1] Mon, 17 Jun 2024 21:06:00 UTC (1,086 KB)
[v2] Wed, 19 Jun 2024 17:56:17 UTC (1,086 KB)

Computer Science > Machine Learning

Title:Is poisoning a real threat to LLM alignment? Maybe more so than you think

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Is poisoning a real threat to LLM alignment? Maybe more so than you think

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators