Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Zhang, Qingjie; Qiu, Han; Wang, Di; Qian, Haoting; Li, Yiming; Zhang, Tianwei; Huang, Minlie

Computer Science > Computation and Language

arXiv:2412.14959 (cs)

[Submitted on 19 Dec 2024]

Title:Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Authors:Qingjie Zhang, Han Qiu, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang

View PDF HTML (experimental)

Abstract:Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at this https URL.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.14959 [cs.CL]
	(or arXiv:2412.14959v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.14959

Submission history

From: Han Qiu [view email]
[v1] Thu, 19 Dec 2024 15:39:31 UTC (15,664 KB)

Computer Science > Computation and Language

Title:Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Understanding the Dark Side of LLMs' Intrinsic Self-Correction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators