MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Farquhar, Sebastian; Varma, Vikrant; Lindner, David; Elson, David; Biddulph, Caleb; Goodfellow, Ian; Shah, Rohin

Computer Science > Machine Learning

arXiv:2501.13011 (cs)

[Submitted on 22 Jan 2025]

Title:MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Authors:Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

View PDF HTML (experimental)

Abstract:Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.13011 [cs.LG]
	(or arXiv:2501.13011v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.13011

Submission history

From: David Lindner [view email]
[v1] Wed, 22 Jan 2025 16:53:08 UTC (1,101 KB)

Computer Science > Machine Learning

Title:MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators