Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Song, Mingyang; Zheng, Mao; Luo, Xuan

Computer Science > Computation and Language

arXiv:2406.11629 (cs)

[Submitted on 17 Jun 2024 (v1), last revised 24 Jun 2024 (this version, v2)]

Title:Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Authors:Mingyang Song, Mao Zheng, Xuan Luo

View PDF HTML (experimental)

Abstract:Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. The former uses in-context examples with model-generated rationales, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.

Comments:	work in progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.11629 [cs.CL]
	(or arXiv:2406.11629v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.11629

Submission history

From: Mingyang Song [view email]
[v1] Mon, 17 Jun 2024 15:11:58 UTC (221 KB)
[v2] Mon, 24 Jun 2024 16:02:21 UTC (235 KB)

Computer Science > Computation and Language

Title:Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators