Stochastic optimization of controlled partially observable Markov decision processes
Date
Authors
Bartlett, P. L.
Baxter, J.
Journal Title
Journal ISSN
Volume Title
Publisher
Access Statement
Abstract
We introduce an on-line algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter β ∈ [0,1], which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of β is related to the mixing time of the Markov chain induced by the POMDP.
Description
Keywords
Citation
Collections
Source
Proceedings of the IEEE Conference on Decision and Control
Type
Book Title
Entity type
Publication