Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, Deep; Lovitt, Liane; Kernion, Jackson; Askell, Amanda; Bai, Yuntao; Kadavath, Saurav; Mann, Ben; Perez, Ethan; Schiefer, Nicholas; Ndousse, Kamal; Jones, Andy; Bowman, Sam; Chen, Anna; Conerly, Tom; DasSarma, Nova; Drain, Dawn; Elhage, Nelson; El-Showk, Sheer; Fort, Stanislav; Hatfield-Dodds, Zac; Henighan, Tom; Hernandez, Danny; Hume, Tristan; Jacobson, Josh; Johnston, Scott; Kravec, Shauna; Olsson, Catherine; Ringer, Sam; Tran-Johnson, Eli; Amodei, Dario; Brown, Tom; Joseph, Nicholas; McCandlish, Sam; Olah, Chris; Kaplan, Jared; Clark, Jack

Computer Science > Computation and Language

arXiv:2209.07858 (cs)

[Submitted on 23 Aug 2022 (v1), last revised 22 Nov 2022 (this version, v2)]

Title:Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

View PDF

Abstract:We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2209.07858 [cs.CL]
	(or arXiv:2209.07858v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2209.07858

Submission history

From: Deep Ganguli [view email]
[v1] Tue, 23 Aug 2022 23:37:14 UTC (8,851 KB)
[v2] Tue, 22 Nov 2022 19:12:57 UTC (8,851 KB)

Computer Science > Computation and Language

Title:Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators