Skip navigation
Skip navigation

Q-learning for history-based reinforcement learning

Daswani, Mayank; Sunehag, Peter; Hutter, Marcus

Description

We extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding l0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP ...[Show more]

dc.contributor.authorDaswani, Mayank
dc.contributor.authorSunehag, Peter
dc.contributor.authorHutter, Marcus
dc.date.accessioned2015-08-13T04:36:45Z
dc.date.available2015-08-13T04:36:45Z
dc.identifier.issn1532-4435
dc.identifier.urihttp://hdl.handle.net/1885/14711
dc.description.abstractWe extend the Q-learning algorithm from the Markov Decision Process setting to problems where observations are non-Markov and do not reveal the full state of the world i.e. to POMDPs. We do this in a natural manner by adding l0 regularisation to the pathwise squared Q-learning objective function and then optimise this over both a choice of map from history to states and the resulting MDP parameters. The optimisation procedure involves a stochastic search over the map class nested with classical Q-learning of the parameters. This algorithm fits perfectly into the feature reinforcement learning framework, which chooses maps based on a cost criteria. The cost criterion used so far for feature reinforcement learning has been model-based and aimed at predicting future states and rewards. Instead we directly predict the return, which is what is needed for choosing optimal actions. Our Q-learning criteria also lends itself immediately to a function approximation setting where features are chosen based on the history. This algorithm is somewhat similar to the recent line of work on lasso temporal difference learning which aims at finding a small feature set with which one can perform policy evaluation. The distinction is that we aim directly for learning the Q-function of the optimal policy and we use l0 instead of l1 regularisation. We perform an experimental evaluation on classical benchmark domains and find improvement in convergence speed as well as in economy of the state representation. We also compare against MC-AIXI on the large Pocman domain and achieve competitive performance in average reward. We use less than half the CPU time and 36 times less memory. Overall, our algorithm hQL provides a better combination of computational, memory and data efficiency than existing algorithms in this setting.
dc.publisherJournal of Machine Learning Research
dc.relation.ispartofJMLR Workshop and Conference Proceedings: Volume 29: Asian Conference on Machine Learning
dc.rights© 2013 M. Daswani, P. Sunehag & M. Hutter. Author can archive publisher’s version/PDF. http://www.sherpa.ac.uk/romeo/issn/1532-4435/ as at 13/8/15
dc.subjectfeature reinforcement learning
dc.subjecttemporal difference learning
dc.subjectMarkov decision process
dc.subjectpartial observability
dc.subjectQ-learning
dc.subjectMonte Carlo search
dc.subjectPocman
dc.subjectrational agents
dc.titleQ-learning for history-based reinforcement learning
dc.typeConference paper
local.identifier.citationvolume29
dc.date.issued2013-11
local.publisher.urlhttp://www.jmlr.org/
local.type.statusPublished Version
local.contributor.affiliationDaswani, M., Research School of Computer Science, The Australian National University
local.contributor.affiliationSunehag, P., Research School of Computer Science, The Australian National University
local.contributor.affiliationHutter, M., Research School of Computer Science, The Australian National University
dc.relationhttp://purl.org/au-research/grants/arc/DP120100950
local.bibliographicCitation.startpage213
local.bibliographicCitation.lastpage228
CollectionsANU Research Publications

Download

File Description SizeFormat Image
Daswani et al QLearning for history based 2013.pdf299.15 kBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  19 May 2020/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator