Q-learning for history-based reinforcement learning
Date
2013-11
Authors
Daswani, Mayank
Sunehag, Peter
Hutter, Marcus
Journal Title
Journal ISSN
Volume Title
Publisher
Journal of Machine Learning Research
Abstract
We extend the Q-learning algorithm from the Markov Decision Process
setting to problems where observations are non-Markov and do not
reveal the full state of the world i.e. to POMDPs. We do this in a
natural manner by adding l0 regularisation to the pathwise squared
Q-learning objective function and then optimise this over both a
choice of map from history to states and the resulting MDP
parameters. The optimisation procedure involves a stochastic search
over the map class nested with classical Q-learning of the
parameters. This algorithm fits perfectly into the feature
reinforcement learning framework, which chooses maps based on a
cost criteria. The cost criterion used so far for feature
reinforcement learning has been model-based and aimed at predicting
future states and rewards. Instead we directly predict the return,
which is what is needed for choosing optimal actions. Our
Q-learning criteria also lends itself immediately to a function
approximation setting where features are chosen based on the
history. This algorithm is somewhat similar to the recent line of
work on lasso temporal difference learning which aims at finding a
small feature set with which one can perform policy evaluation. The
distinction is that we aim directly for learning the Q-function of
the optimal policy and we use l0 instead of l1 regularisation. We
perform an experimental evaluation on classical benchmark domains
and find improvement in convergence speed as well as in economy of
the state representation. We also compare against MC-AIXI on the
large Pocman domain and achieve competitive performance in average
reward. We use less than half the CPU time and 36 times less
memory. Overall, our algorithm hQL provides a better combination of
computational, memory and data efficiency than existing algorithms in
this setting.
Description
Keywords
feature reinforcement learning, temporal difference learning, Markov decision process, partial observability, Q-learning, Monte Carlo search, Pocman, rational agents
Citation
Collections
Source
Type
Conference paper
Book Title
JMLR Workshop and Conference Proceedings: Volume 29: Asian Conference on Machine Learning
Entity type
Access Statement
Open Access
License Rights
Creative Commons Attribution licence