Stochastic Optimisation of Controlled Partially Observable Markov Decision Processes
We introduce an on-line algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter β ∈ [0,1], which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In...[Show more]
|Collections||ANU Research Publications|
|Source:||39th IEEE Conference on Decision and Control|
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.