ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Zhao, Haiquan; Li, Lingyu; Chen, Shisong; Kong, Shuqi; Wang, Jiaan; Huang, Kexin; Gu, Tianle; Wang, Yixu; Liang, Dandan; Li, Zhixu; Teng, Yan; Xiao, Yanghua; Wang, Yingchun

Computer Science > Computation and Language

arXiv:2406.14952v2 (cs)

[Submitted on 21 Jun 2024 (v1), revised 24 Jun 2024 (this version, v2), latest version 28 Oct 2024 (v3)]

Title:ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Authors:Haiquan Zhao, Lingyu Li, Shisong Chen, Shuqi Kong, Jiaan Wang, Kexin Huang, Tianle Gu, Yixu Wang, Dandan Liang, Zhixu Li, Yan Teng, Yanghua Xiao, Yingchun Wang

View PDF HTML (experimental)

Abstract:Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at this https URL.

Comments:	Pre-print
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.14952 [cs.CL]
	(or arXiv:2406.14952v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.14952

Submission history

From: Haiquan Zhao [view email]
[v1] Fri, 21 Jun 2024 08:03:33 UTC (2,560 KB)
[v2] Mon, 24 Jun 2024 12:24:52 UTC (2,560 KB)
[v3] Mon, 28 Oct 2024 13:25:49 UTC (2,560 KB)

Computer Science > Computation and Language

Title:ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators