Feature reinforcement learning agents

Date

Authors

Nguyen, Phuong Minh

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Reinforcement Learning (RL) is currently an active research area of Artificial Intelligence (AI) in which an agent interacts with an unknown environment in order to collect as much reward as possible. One of the most challenging problems in AI is the General Reinforcement Learning (GRL) problem where the agent does not have any knowledge about the environment except the access to observations and rewards after taking actions. A recently proposed framework called Feature Markov Decision Process (PhiMDP) or Feature Reinforcement Learning (FRL) offers a sound and generic approach to addressing the GRL problem. The central idea in FRL is to find a good state representation function, which maps histories into some states of some (approximate) Markov Decision Process (MDP). This thesis further develops the FRL (or PhiMDP) agent model and contains the following contributions: First, we examine the PhiMDP cost functions through a careful empirical and theoretical nvestigation on a variety of toy examples. Most noticeably, we detect some issues of the cost functions proposed in the original PhiMDP framework. Second, we develop a practical PhiMDP agent using stochastic search (parallel tempering) over some space of feature maps (state representation functions, or shortly models) for minimizing a PhiMDP cost criterion; and provide a thorough comparative empirical analysis of the proposed algorithm on the class of context trees. Third, we develop a new GRL algorithm, called Context Tree Maximizing Reinforcement Learning (CTMRL), that can analytically compute the optimal context tree in linear time to overcome shortcomings inherent in the above PhiMDP stochastic search approach. Apart from the empirical GRL, we are also interested in theoretical analysis of GRL, especially finite-time analysis where the primary goal is to design strategies that, in finite time, play as well as the agent that knows the underlying environment's states, model dynamics, and its optimal policy. Our fourth main contribution is the proposal of an algorithm named IBLB, an extension of the Best Lower Bound (BLB) algorithm, which offers an efficient strategy for selecting good state representations among a given set of models. BLB and IBLB are finite-time developments of the FRL framework under the average-reward scheme where the data efficiency is the primary goal. BLB achieves a regret of order T2/3 with T being the horizon time. However, the regret bound of BLB contains an additive term exponential in the diameter of the considered Markov model against which the performance of BLB is assessed. Our proposed algorithm, IBLB, on the other hand, can deal with a set of infinitely many models, and still achieves a regret of order T2/3 but with all constants reasonably small. Finally, we propose three other algorithms named OMS, ROMS and IOMS, which all attain the optimal regret order of T1/2. These algorithms address a similar GRL setting as BLB and IBLB, but they are developed using the principle of optimism under uncertainty instead of the BLB strategy.

Description

Keywords

Citation

Source

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads