Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Ma, Zailin; Yang, Jiansheng; Zhang, Zhihua

Computer Science > Computer Science and Game Theory

arXiv:2308.07873 (cs)

This paper has been withdrawn by Zailin Ma

[Submitted on 15 Aug 2023 (v1), last revised 16 Aug 2023 (this version, v2)]

Title:Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Authors:Zailin Ma, Jiansheng Yang, Zhihua Zhang

No PDF available, click to view other formats

Abstract:Computing approximate Nash equilibria in multi-player general-sum Markov games is a computationally intractable task. However, multi-player Markov games with certain cooperative or competitive structures might circumvent this intractability. In this paper, we focus on multi-player zero-sum polymatrix Markov games, where players interact in a pairwise fashion while remain overall competitive. To the best of our knowledge, we propose the first policy optimization algorithm called Entropy-Regularized Optimistic-Multiplicative-Weights-Update (ER-OMWU) for finding approximate Nash equilibria in finite-horizon zero-sum polymatrix Markov games with full information feedback. We provide last-iterate convergence guarantees for finding an $\epsilon$-approximate Nash equilibrium within $\tilde{O}(1/\epsilon)$ iterations, which is near-optimal compared to the optimal $O(1/\epsilon)$ iteration complexity in two-player zero-sum Markov games, which is a degenerate case of zero-sum polymatrix games with only two players involved. Our algorithm combines the regularized and optimistic learning dynamics with separated smooth value update within a single loop, where players update strategies in a symmetric and almost uncoupled manner. It provides a natural dynamics for finding equilibria and is more probable to be adapted to a sample-efficient and fully decentralized implementation where only partial information feedback is available in the future.

Comments:	Proof of Lemma 3.4 is wrong, \bar{L}_{h+1}^{t-k} should be replaced by \sqrt{\bar{L}_{h+1}^{t-k}}. In this case the proof of the main theorem should be substantially modified
Subjects:	Computer Science and Game Theory (cs.GT)
Cite as:	arXiv:2308.07873 [cs.GT]
	(or arXiv:2308.07873v2 [cs.GT] for this version)
	https://doi.org/10.48550/arXiv.2308.07873

Submission history

From: Zailin Ma [view email]
[v1] Tue, 15 Aug 2023 16:40:10 UTC (34 KB)
[v2] Wed, 16 Aug 2023 17:17:48 UTC (1 KB) (withdrawn)

Computer Science > Computer Science and Game Theory

Title:Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Science and Game Theory

Title:Near-Optimal Last-iterate Convergence of Policy Optimization in Zero-sum Polymatrix Markov games

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators