On Q-learning convergence for non-Markov decision processes
Loading...
Date
Authors
Majeed, Sultan
Hutter, Marcus
Journal Title
Journal ISSN
Volume Title
Publisher
AAAI Press
Abstract
Temporal-difference (TD) learning is an attractive, computationally efficient framework for model-free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.
Description
Keywords
Citation
Collections
Source
IJCAI International Joint Conference on Artificial Intelligence
Type
Book Title
Entity type
Access Statement
Free Access via publisher website
License Rights
Restricted until
2099-12-31
Downloads
File
Description