Experiments with Infinite-Horizon, Policy-Gradient Estimation

Baxter, Jon; Bartlett, Peter; Weaver, L

A change is coming. Click to see a sneak peek of the new Open Research Repository.

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Description

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires...[Show more] no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

dc.contributor.author	Baxter, Jon
dc.contributor.author	Bartlett, Peter
dc.contributor.author	Weaver, L
dc.date.accessioned	2015-12-10T23:11:57Z
dc.identifier.issn	1076-9757
dc.identifier.uri	http://hdl.handle.net/1885/63908
dc.description.abstract	In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.
dc.publisher	Morgan Kauffman Publishers
dc.source	Journal of Artificial Intelligence Research
dc.title	Experiments with Infinite-Horizon, Policy-Gradient Estimation
dc.type	Journal article
local.description.notes	Imported from ARIES
local.description.refereed	Yes
local.identifier.citationvolume	15
dc.date.issued	2001
local.identifier.absfor	010303 - Optimisation
local.identifier.ariespublication	MigratedxPub862
local.type.status	Published Version
local.contributor.affiliation	Baxter, Jon, College of Engineering and Computer Science, ANU
local.contributor.affiliation	Bartlett, Peter, College of Engineering and Computer Science, ANU
local.contributor.affiliation	Weaver, L, College of Engineering and Computer Science, ANU
local.description.embargo	2037-12-31
local.bibliographicCitation.startpage	351
local.bibliographicCitation.lastpage	381
dc.date.updated	2015-12-10T09:25:21Z
local.identifier.scopusID	2-s2.0-0013495368
Collections	ANU Research Publications

Download

File	Description	Size	Format	Image
01_Baxter_Experiments_with_2001.pdf		246.58 kB	Adobe PDF	Request a copy

Show simple item record