On Q-learning convergence for non-Markov decision processes

Majeed, Sultan; Hutter, Marcus

On Q-learning convergence for non-Markov decision processes

Date

2018

Authors

Majeed, Sultan

Hutter, Marcus

Publisher

AAAI Press

Abstract

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model-free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.

URI

http://hdl.handle.net/1885/313408

Collections

ANU Research Publications

Source

IJCAI International Joint Conference on Artificial Intelligence

Type

Conference paper

Access Statement

Free Access via publisher website

DOI

10.24963/ijcai.2018/353

Restricted until

2099-12-31

Downloads

File

Description

0353.pdf (266.28 KB)

Full item page

On Q-learning convergence for non-Markov decision processes

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

Downloads