Why Anc-VI Is Crucial For Undiscounted Reinforcement Learning

Authors:

(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;

(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.

Abstract and 1 Introduction

1.1 Notations and preliminaries

1.2 Prior works

2 Anchored Value Iteration

2.1 Accelerated rate for Bellman consistency operator

2.2 Accelerated rate for Bellman optimality opera

3 Convergence when y=1

4 Complexity lower bound

5 Approximate Anchored Value Iteration

6 Gauss–Seidel Anchored Value Iteration

7 Conclusion, Acknowledgments and Disclosure of Funding and References

A Preliminaries

B Omitted proofs in Section 2

C Omitted proofs in Section 3

D Omitted proofs in Section 4

E Omitted proofs in Section 5

F Omitted proofs in Section 6

G Broader Impacts

H Limitations

3 Convergence when y=1

Undiscounted MDPs are not commonly studied in the DP and RL theory literature due to the following difficulties: Bellman consistency and optimality operators may not have fixed points, VI is a nonexpansive (not contractive) fixed-point iteration and may not convergence to a fixed point even if one exist, and the interpretation of a fixed point as the (optimal) value function becomes unclear when the fixed point is not unique. However, many modern deep RL setups actually do not use discounting, [2] and this empirical practice makes the theoretical analysis with γ = 1 relevant.

In this section, we show that Anc-VI converges to fixed points of the Bellman consistency and optimality operators of undiscounted MDPs. While a full treatment of undiscounted MDPs is beyond the scope of this paper, we show that fixed points, if one exists, can be found, and we therefore argue that the inability to find fixed points should not be considered an obstacle in studying the γ = 1 setup.

We first state our convergence result for finite state-action spaces.

[3] Well-definedness of T requires a σ-algebra on state and action spaces, expectation with respect to transition probability and policy to be well defined, boundedness and measurability of the output of Bellman operator, etc.

Why Anc-VI is Crucial for Undiscounted Reinforcement Learning | HackerNoon

3 Convergence when y=1

Leave a Reply

3 Convergence when y=1

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply