Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Bolland, Adrien; Lambrechts, Gaspard; Ernst, Damien

Computer Science > Machine Learning

arXiv:2412.06655 (cs)

[Submitted on 9 Dec 2024]

Title:Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Authors:Adrien Bolland, Gaspard Lambrechts, Damien Ernst

View PDF HTML (experimental)

Abstract:We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2412.06655 [cs.LG]
	(or arXiv:2412.06655v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2412.06655

Submission history

From: Adrien Bolland [view email]
[v1] Mon, 9 Dec 2024 16:56:06 UTC (2,191 KB)

Computer Science > Machine Learning

Title:Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators