Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Song, Jun; He, Niao; Ding, Lijun; Zhao, Chaoyue

Computer Science > Machine Learning

arXiv:2306.14133 (cs)

[Submitted on 25 Jun 2023]

Title:Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Authors:Jun Song, Niao He, Lijun Ding, Chaoyue Zhao

View PDF

Abstract:Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their closed-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Cite as:	arXiv:2306.14133 [cs.LG]
	(or arXiv:2306.14133v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.14133
Journal reference:	Transactions on Machine Learning Research, 2023

Submission history

From: Jun Song [view email]
[v1] Sun, 25 Jun 2023 05:41:38 UTC (2,245 KB)

Computer Science > Machine Learning

Title:Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators