Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Courtois, Martin; Ostendorff, Malte; Hennig, Leonhard; Rehm, Georg

Computer Science > Computation and Language

arXiv:2406.06366v1 (cs)

[Submitted on 10 Jun 2024 (this version), latest version 19 Jun 2024 (v2)]

Title:Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Authors:Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

View PDF HTML (experimental)

Abstract:Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.

Comments:	to be published in Findings of the Association for Computational Linguistics: ACL 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.06366 [cs.CL]
	(or arXiv:2406.06366v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.06366

Submission history

From: Martin Courtois [view email]
[v1] Mon, 10 Jun 2024 15:24:15 UTC (258 KB)
[v2] Wed, 19 Jun 2024 10:42:15 UTC (256 KB)

Computer Science > Computation and Language

Title:Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators