Authors:
(1) Jongmin Lee, Department of Mathematical Science, Seoul National University;
(2) Ernest K. Ryu, Department of Mathematical Science, Seoul National University and Interdisciplinary Program in Artificial Intelligence, Seoul National University.
Abstract and 1 Introduction
1.1 Notations and preliminaries
1.2 Prior works
2 Anchored Value Iteration
2.1 Accelerated rate for Bellman consistency operator
2.2 Accelerated rate for Bellman optimality opera
3 Convergence when y=1
4 Complexity lower bound
5 Approximate Anchored Value Iteration
6 Gauss–Seidel Anchored Value Iteration
7 Conclusion, Acknowledgments and Disclosure of Funding and References
A Preliminaries
B Omitted proofs in Section 2
C Omitted proofs in Section 3
D Omitted proofs in Section 4
E Omitted proofs in Section 5
F Omitted proofs in Section 6
G Broader Impacts
H Limitations
Abstract
Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a O(γ k )-rate, where γ is the discount factor. Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov’s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a O(1/k)-rate for γ ≈ 1 or even γ = 1, while standard VI has rate O(1) for γ ≥ 1 − 1/k, where k is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 4, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss–Seidel VI setups as well.
1 Introduction
Value Iteration (VI) is foundational to the theory and practice of modern dynamic programming (DP) and reinforcement learning (RL). It is well known that when a discount factor γ < 1 is used, (exact) VI is a contractive iteration in the k·k∞-norm and therefore converges. The progress of VI is measured by the Bellman error in practice (as the distance to the fixed point is not computable), and much prior work has been dedicated to analyzing the rates of convergence of VI and its variants.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).