Reducing the Transformer Architecture to a Minimum

Bermeitinger, Bernhard; Hrycej, Tomas; Pavone, Massimo; Kath, Julianus; Handschuh, Siegfried

doi:10.5220/0012891000003838

Computer Science > Machine Learning

arXiv:2410.13732 (cs)

[Submitted on 17 Oct 2024 (v1), last revised 29 Oct 2024 (this version, v2)]

Title:Reducing the Transformer Architecture to a Minimum

Authors:Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath, Siegfried Handschuh

View PDF HTML (experimental)

Abstract:Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further components can also be reorganized to reduce the number of parameters. Under some conditions, query and key matrices can be collapsed into a single matrix of the same size. The same is true about value and projection matrices, which can also be omitted without eliminating the substance of the attention mechanism. Initially, the similarity measure was defined asymmetrically, with peculiar properties such as that a token is possibly dissimilar to itself. A possible symmetric definition requires only half of the parameters. We have laid the groundwork by testing widespread CV benchmarks: MNIST and CIFAR-10. The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90% of parameters without hurting the classification performance.

Comments:	8 pages, to appear in KDIR2024
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2410.13732 [cs.LG]
	(or arXiv:2410.13732v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.13732
Journal reference:	Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR2024
Related DOI:	https://doi.org/10.5220/0012891000003838

Submission history

From: Bernhard Bermeitinger [view email]
[v1] Thu, 17 Oct 2024 16:36:14 UTC (155 KB)
[v2] Tue, 29 Oct 2024 14:13:27 UTC (155 KB)

Computer Science > Machine Learning

Title:Reducing the Transformer Architecture to a Minimum

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Reducing the Transformer Architecture to a Minimum

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators