Experiments with Infinite-Horizon, Policy-Gradient Estimation

Baxter, Jon; Bartlett, Peter; Weaver, L

Experiments with Infinite-Horizon, Policy-Gradient Estimation

dc.contributor.author	Baxter, Jon
dc.contributor.author	Bartlett, Peter
dc.contributor.author	Weaver, L
dc.date.accessioned	2015-12-10T23:11:57Z
dc.date.issued	2001
dc.date.updated	2015-12-10T09:25:21Z
dc.description.abstract	In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.
dc.identifier.issn	1076-9757
dc.identifier.uri	http://hdl.handle.net/1885/63908
dc.publisher	Morgan Kauffman Publishers
dc.source	Journal of Artificial Intelligence Research
dc.title	Experiments with Infinite-Horizon, Policy-Gradient Estimation
dc.type	Journal article
local.bibliographicCitation.lastpage	381
local.bibliographicCitation.startpage	351
local.contributor.affiliation	Baxter, Jon, College of Engineering and Computer Science, ANU
local.contributor.affiliation	Bartlett, Peter, College of Engineering and Computer Science, ANU
local.contributor.affiliation	Weaver, L, College of Engineering and Computer Science, ANU
local.contributor.authoruid	Baxter, Jon, u9612464
local.contributor.authoruid	Bartlett, Peter, u9301805
local.contributor.authoruid	Weaver, L, u9405743
local.description.embargo	2037-12-31
local.description.notes	Imported from ARIES
local.description.refereed	Yes
local.identifier.absfor	010303 - Optimisation
local.identifier.ariespublication	MigratedxPub862
local.identifier.citationvolume	15
local.identifier.scopusID	2-s2.0-0013495368
local.type.status	Published Version

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: 01_Baxter_Experiments_with_2001.pdf
Size:: 246.58 KB
Format:: Adobe Portable Document Format

Download

Collections

ANU Research Publications

Cultural advice

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Downloads

Original bundle

Collections