Conan-embedding: General Text Embedding with More and Better Negative Samples

Li, Shiyu; Tang, Yang; Chen, Shizhe; Chen, Xi

Computer Science > Computation and Language

arXiv:2408.15710 (cs)

[Submitted on 28 Aug 2024 (v1), last revised 29 Aug 2024 (this version, v2)]

Title:Conan-embedding: General Text Embedding with More and Better Negative Samples

Authors:Shiyu Li, Yang Tang, Shizhe Chen, Xi Chen

View PDF HTML (experimental)

Abstract:With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2408.15710 [cs.CL]
	(or arXiv:2408.15710v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.15710

Submission history

From: Shiyu Li [view email]
[v1] Wed, 28 Aug 2024 11:18:06 UTC (1,398 KB)
[v2] Thu, 29 Aug 2024 14:47:37 UTC (1,398 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computation and Language

Title:Conan-embedding: General Text Embedding with More and Better Negative Samples

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computation and Language

Title:Conan-embedding: General Text Embedding with More and Better Negative Samples

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators