Daswani, Mayank; Sunehag, Peter; Hutter, Marcus
We extend the Q-learning algorithm from the Markov Decision Process
setting to problems where observations are non-Markov and do not
reveal the full state of the world i.e. to POMDPs. We do this in a
natural manner by adding l0 regularisation to the pathwise squared
Q-learning objective function and then optimise this over both a
choice of map from history to states and the resulting MDP
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.