KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Hu, Xinshuo; Shan, Zifei; Zhao, Xinping; Sun, Zetian; Liu, Zhenyu; Li, Dongfang; Ye, Shaolin; Wei, Xinyuan; Chen, Qian; Hu, Baotian; Zhang, Min

Computer Science > Computation and Language

arXiv:2501.01028 (cs)

[Submitted on 2 Jan 2025 (v1), last revised 3 Jan 2025 (this version, v2)]

Title:KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Authors:Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Min Zhang

View PDF HTML (experimental)

Abstract:As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

Comments:	Technical Report. 23 pages, 6 figures, 10 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.01028 [cs.CL]
	(or arXiv:2501.01028v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.01028

Submission history

From: Xinshuo Hu [view email]
[v1] Thu, 2 Jan 2025 03:17:51 UTC (444 KB)
[v2] Fri, 3 Jan 2025 03:16:10 UTC (444 KB)

Computer Science > Computation and Language

Title:KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators