Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Musat, Tiberiu

Computer Science > Machine Learning

arXiv:2411.12118 (cs)

[Submitted on 18 Nov 2024]

Title:Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Authors:Tiberiu Musat

View PDF HTML (experimental)

Abstract:In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2411.12118 [cs.LG]
	(or arXiv:2411.12118v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.12118

Submission history

From: Tiberiu Musat [view email]
[v1] Mon, 18 Nov 2024 23:12:13 UTC (1,204 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2024-11

Change to browse by:

cs
cs.CL

References & Citations

export BibTeX citation

Computer Science > Machine Learning

Title:Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators