Skip navigation
Skip navigation

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Baxter, Jon; Bartlett, Peter; Weaver, L


In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires...[Show more]

CollectionsANU Research Publications
Date published: 2001
Type: Journal article
Source: Journal of Artificial Intelligence Research


File Description SizeFormat Image
01_Baxter_Experiments_with_2001.pdf246.58 kBAdobe PDF    Request a copy

Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  20 July 2017/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator