LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

BehnamGhader, Parishad; Adlakha, Vaibhav; Mosbach, Marius; Bahdanau, Dzmitry; Chapados, Nicolas; Reddy, Siva

Computer Science > Computation and Language

arXiv:2404.05961 (cs)

[Submitted on 9 Apr 2024 (v1), last revised 21 Aug 2024 (this version, v2)]

Title:LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Authors:Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy

View PDF HTML (experimental)

Abstract:Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 4 popular LLMs ranging from 1.3B to 8B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data (as of May 24, 2024). Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

Comments:	Accepted to COLM 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.05961 [cs.CL]
	(or arXiv:2404.05961v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.05961

Submission history

From: Parishad BehnamGhader [view email]
[v1] Tue, 9 Apr 2024 02:51:05 UTC (3,854 KB)
[v2] Wed, 21 Aug 2024 22:46:05 UTC (3,907 KB)

Computer Science > Computation and Language

Title:LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators