Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models

Zhang, Ying; Li, Dongyuan; Okumura, Manabu

Computer Science > Computation and Language

arXiv:2408.01308 (cs)

[Submitted on 2 Aug 2024]

Title:Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models

Authors:Ying Zhang, Dongyuan Li, Manabu Okumura

View PDF

Abstract:Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out the distribution of learned embeddings degenerates into anisotropy, and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes fine-tuning dynamics of a PLM, BART-large, and demonstrates its robustness against degeneration. On the basis of this finding, we propose DefinitionEMB, a method that utilizes definitions to construct isotropically distributed and semantics-related token embeddings for PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to construct such embeddings for RoBERTa-base and BART-large. Furthermore, the constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2408.01308 [cs.CL]
	(or arXiv:2408.01308v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.01308

Submission history

From: Ying Zhang [view email]
[v1] Fri, 2 Aug 2024 15:00:05 UTC (18,797 KB)

Computer Science > Computation and Language

Title:Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Reconsidering Token Embeddings with the Definitions for Pre-trained Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators