Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Kallus, Nathan; Uehara, Masatoshi

Statistics > Machine Learning

arXiv:1909.05850 (stat)

[Submitted on 12 Sep 2019 (v1), last revised 15 Jan 2023 (this version, v6)]

Title:Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Authors:Nathan Kallus, Masatoshi Uehara

View PDF

Abstract:Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.

Comments:	In V3, we significantly changed the derivation of the efficiency bound to follow standard (iid) semiparametric theory. We also derive the efficient influence function. In V4, we add an experiment in a continuous-state environment employing function approximation. In v6, we fixed several typos. Please refer to this version as the final version
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
Cite as:	arXiv:1909.05850 [stat.ML]
	(or arXiv:1909.05850v6 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1909.05850

Submission history

From: Masatoshi Uehara [view email]
[v1] Thu, 12 Sep 2019 17:52:55 UTC (273 KB)
[v2] Mon, 23 Dec 2019 10:10:10 UTC (279 KB)
[v3] Thu, 11 Mar 2021 17:57:41 UTC (308 KB)
[v4] Wed, 24 Mar 2021 04:35:06 UTC (308 KB)
[v5] Mon, 3 May 2021 15:27:23 UTC (340 KB)
[v6] Sun, 15 Jan 2023 15:05:42 UTC (602 KB)

Statistics > Machine Learning

Title:Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators