CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Xu, Tengyu; Liang, Yingbin; Lan, Guanghui

Computer Science > Machine Learning

arXiv:2011.05869 (cs)

[Submitted on 11 Nov 2020 (v1), last revised 31 May 2021 (this version, v3)]

Title:CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Authors:Tengyu Xu, Yingbin Liang, Guanghui Lan

View PDF

Abstract:In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

Comments:	Published in ICML 2021
Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2011.05869 [cs.LG]
	(or arXiv:2011.05869v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2011.05869

Submission history

From: Tengyu Xu [view email]
[v1] Wed, 11 Nov 2020 16:05:14 UTC (63 KB)
[v2] Tue, 17 Nov 2020 21:24:18 UTC (64 KB)
[v3] Mon, 31 May 2021 04:41:09 UTC (111 KB)

Computer Science > Machine Learning

Title:CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators