Defending Against Indirect Prompt Injection Attacks With Spotlighting

Hines, Keegan; Lopez, Gary; Hall, Matthew; Zarfati, Federico; Zunger, Yonatan; Kiciman, Emre

Computer Science > Cryptography and Security

arXiv:2403.14720 (cs)

[Submitted on 20 Mar 2024]

Title:Defending Against Indirect Prompt Injection Attacks With Spotlighting

Authors:Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, Emre Kiciman

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

Subjects:	Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2403.14720 [cs.CR]
	(or arXiv:2403.14720v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2403.14720

Submission history

From: Keegan Hines E [view email]
[v1] Wed, 20 Mar 2024 15:26:23 UTC (176 KB)

Computer Science > Cryptography and Security

Title:Defending Against Indirect Prompt Injection Attacks With Spotlighting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Defending Against Indirect Prompt Injection Attacks With Spotlighting

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators