A transfer learning framework for weak-to-strong generalization

Somerstep, Seamus; Polo, Felipe Maia; Banerjee, Moulinath; Ritov, Ya'acov; Yurochkin, Mikhail; Sun, Yuekai

Statistics > Machine Learning

arXiv:2405.16236 (stat)

[Submitted on 25 May 2024 (v1), last revised 14 Mar 2025 (this version, v3)]

Title:A transfer learning framework for weak-to-strong generalization

Authors:Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya'acov Ritov, Mikhail Yurochkin, Yuekai Sun

View PDF HTML (experimental)

Abstract:Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.

Comments:	v2: Major changes to set up, theory, and experiments v3: Camera ready
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2405.16236 [stat.ML]
	(or arXiv:2405.16236v3 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2405.16236

Submission history

From: Seamus Somerstep [view email]
[v1] Sat, 25 May 2024 13:54:05 UTC (1,645 KB)
[v2] Thu, 28 Nov 2024 14:58:34 UTC (1,520 KB)
[v3] Fri, 14 Mar 2025 17:08:22 UTC (1,520 KB)

Statistics > Machine Learning

Title:A transfer learning framework for weak-to-strong generalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:A transfer learning framework for weak-to-strong generalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators