Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

He, Tianyu; Doshi, Darshil; Das, Aritra; Gromov, Andrey

Computer Science > Machine Learning

arXiv:2406.02550 (cs)

[Submitted on 4 Jun 2024 (v1), last revised 4 Nov 2024 (this version, v2)]

Title:Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Authors:Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov

View PDF HTML (experimental)

Abstract:Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.

Comments:	Camera-ready version, NeurIPS 2024 (Oral)
Subjects:	Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
Cite as:	arXiv:2406.02550 [cs.LG]
	(or arXiv:2406.02550v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.02550

Submission history

From: Tianyu He [view email]
[v1] Tue, 4 Jun 2024 17:59:36 UTC (19,500 KB)
[v2] Mon, 4 Nov 2024 16:04:27 UTC (38,683 KB)

Computer Science > Machine Learning

Title:Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators