Experiments with Infinite-Horizon, Policy-Gradient Estimation

Baxter, Jon; Bartlett, Peter; Weaver, L

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Date

2001

Authors

Baxter, Jon

Bartlett, Peter

Weaver, L

Publisher

Morgan Kauffman Publishers

Abstract

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

URI

http://hdl.handle.net/1885/63908

Collections

ANU Research Publications

Source

Journal of Artificial Intelligence Research

Type

Journal article

Restricted until

2037-12-31

Downloads

File

Description

01_Baxter_Experiments_with_2001.pdf (246.58 KB)

Full item page

Cultural advice

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

Downloads