Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Huang, Tiansheng; Hu, Sihao; Ilhan, Fatih; Tekin, Selim Furkan; Liu, Ling

Computer Science > Cryptography and Security

arXiv:2409.18169 (cs)

[Submitted on 26 Sep 2024 (v1), last revised 29 Oct 2024 (this version, v4)]

Title:Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu

View PDF HTML (experimental)

Abstract:Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe from our miserable submission experience that there are general misunderstandings within the research community.} We in this paper aim to clear some common concerns for the attack setting, and formally establish the research problem. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: \url{this https URL}.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2409.18169 [cs.CR]
	(or arXiv:2409.18169v4 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2409.18169

Submission history

From: Tiansheng Huang [view email]
[v1] Thu, 26 Sep 2024 17:55:22 UTC (1,607 KB)
[v2] Mon, 30 Sep 2024 16:29:58 UTC (1,607 KB)
[v3] Mon, 21 Oct 2024 16:51:22 UTC (1,704 KB)
[v4] Tue, 29 Oct 2024 05:52:43 UTC (1,705 KB)

Computer Science > Cryptography and Security

Title:Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators