KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Yu, Jifan; Wang, Xiaozhi; Tu, Shangqing; Cao, Shulin; Zhang-Li, Daniel; Lv, Xin; Peng, Hao; Yao, Zijun; Zhang, Xiaohan; Li, Hanming; Li, Chunyang; Zhang, Zheyuan; Bai, Yushi; Liu, Yantao; Xin, Amy; Lin, Nianyi; Yun, Kaifeng; Gong, Linlu; Chen, Jianhui; Wu, Zhili; Qi, Yunjia; Li, Weikai; Guan, Yong; Zeng, Kaisheng; Qi, Ji; Jin, Hailong; Liu, Jinxin; Gu, Yu; Yao, Yuan; Ding, Ning; Hou, Lei; Liu, Zhiyuan; Xu, Bin; Tang, Jie; Li, Juanzi

Computer Science > Computation and Language

arXiv:2306.09296 (cs)

[Submitted on 15 Jun 2023 (v1), last revised 1 Jul 2024 (this version, v3)]

Title:KoLA: Carefully Benchmarking World Knowledge of Large Language Models

View PDF

Abstract:The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For \textbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For \textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at this https URL and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

Comments:	Accepted by ICLR 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2306.09296 [cs.CL]
	(or arXiv:2306.09296v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.09296

Submission history

From: Shangqing Tu [view email]
[v1] Thu, 15 Jun 2023 17:20:46 UTC (3,811 KB)
[v2] Thu, 6 Jul 2023 17:25:10 UTC (3,810 KB)
[v3] Mon, 1 Jul 2024 03:38:57 UTC (4,591 KB)

Computer Science > Computation and Language

Title:KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators