Stochastic optimization of controlled partially observable Markov decision processes

Date

Authors

Bartlett, P. L.
Baxter, J.

Journal Title

Journal ISSN

Volume Title

Publisher

Access Statement

Research Projects

Organizational Units

Journal Issue

Abstract

We introduce an on-line algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter β ∈ [0,1], which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of β is related to the mixing time of the Markov chain induced by the POMDP.

Description

Keywords

Citation

Source

Proceedings of the IEEE Conference on Decision and Control

Book Title

Entity type

Publication

Access Statement

License Rights

DOI

Restricted until