Policy Gradient Methods: Variance Reduction and Stochastic Convergence
Abstract
In a reinforcement learning task an agent must learn a policy for performing actions
so as to perform well in a given environment. Policy gradient methods consider a
parameterized class of policies, and using a policy from the class, and a trajectory
through the environment taken by the agent using this policy, estimate the performance
of the policy with respect to the parameters. Policy gradient methods avoid
some of the problems of value function methods, such as policy degradation, where
inaccuracy in the value function leads to the choice of a poor policy. However, the
estimates produced by policy gradient methods can have high variance. ¶ ...