CharacterBench: Benchmarking Character Customization of Large Language Models

Zhou, Jinfeng; Huang, Yongkang; Wen, Bosi; Bi, Guanqun; Chen, Yuxuan; Ke, Pei; Chen, Zhuang; Xiao, Xiyao; Peng, Libiao; Tang, Kuntian; Zhang, Rongsheng; Zhang, Le; Lv, Tangjie; Hu, Zhipeng; Wang, Hongning; Huang, Minlie

Computer Science > Computation and Language

arXiv:2412.11912 (cs)

[Submitted on 16 Dec 2024]

Title:CharacterBench: Benchmarking Character Customization of Large Language Models

Authors:Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

View PDF HTML (experimental)

Abstract:Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at this https URL.

Comments:	AAAI 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.11912 [cs.CL]
	(or arXiv:2412.11912v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.11912

Submission history

From: Jinfeng Zhou [view email]
[v1] Mon, 16 Dec 2024 15:55:34 UTC (6,352 KB)

Computer Science > Computation and Language

Title:CharacterBench: Benchmarking Character Customization of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CharacterBench: Benchmarking Character Customization of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators