Experiments with Infinite-Horizon, Policy-Gradient Estimation
Baxter, Jon; Bartlett, Peter; Weaver, L
Description
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires...[Show more]
dc.contributor.author | Baxter, Jon | |
---|---|---|
dc.contributor.author | Bartlett, Peter | |
dc.contributor.author | Weaver, L | |
dc.date.accessioned | 2015-12-10T23:11:57Z | |
dc.identifier.issn | 1076-9757 | |
dc.identifier.uri | http://hdl.handle.net/1885/63908 | |
dc.description.abstract | In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems. | |
dc.publisher | Morgan Kauffman Publishers | |
dc.source | Journal of Artificial Intelligence Research | |
dc.title | Experiments with Infinite-Horizon, Policy-Gradient Estimation | |
dc.type | Journal article | |
local.description.notes | Imported from ARIES | |
local.description.refereed | Yes | |
local.identifier.citationvolume | 15 | |
dc.date.issued | 2001 | |
local.identifier.absfor | 010303 - Optimisation | |
local.identifier.ariespublication | MigratedxPub862 | |
local.type.status | Published Version | |
local.contributor.affiliation | Baxter, Jon, College of Engineering and Computer Science, ANU | |
local.contributor.affiliation | Bartlett, Peter, College of Engineering and Computer Science, ANU | |
local.contributor.affiliation | Weaver, L, College of Engineering and Computer Science, ANU | |
local.description.embargo | 2037-12-31 | |
local.bibliographicCitation.startpage | 351 | |
local.bibliographicCitation.lastpage | 381 | |
dc.date.updated | 2015-12-10T09:25:21Z | |
local.identifier.scopusID | 2-s2.0-0013495368 | |
Collections | ANU Research Publications |
Download
File | Description | Size | Format | Image |
---|---|---|---|---|
01_Baxter_Experiments_with_2001.pdf | 246.58 kB | Adobe PDF | Request a copy |
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.
Updated: 17 November 2022/ Responsible Officer: University Librarian/ Page Contact: Library Systems & Web Coordinator