Adaptive Recommendations with Bandit Feedback

Zhang, Mengyan

Adaptive Recommendations with Bandit Feedback

Date

2023

Authors

Zhang, Mengyan

Abstract

Sequential decision-making under uncertainty is a fundamental challenge in machine learning. We consider multi-armed bandits problems, where the goal is to design a policy to recommend arms (options) sequentially where in each round only the rewards sampled from the selected arms are available. In this thesis, we study both theoretical and practical aspects of sequential decision-making with bandit feedback. From the theory aspect, we summarise the rewards using quantiles for risk-averse decision-making and selecting a local area of arm space with only aggregated feedback for a new aggregated regret. From the practical side, we study the best arms identification for biological sequence design in real-world applications and focus on the regret minimisation setting for large-scale news recommendations. There are three design choices one can make in bandit tasks, namely arms, objectives and rewards. The theoretical and practical contributions of this thesis cover and enrich the three design choices. We first consider the design choice of arm optimality and propose to use quantiles to summarise the reward distributions. This is under the task of best arms identification with a fixed budget for risk-averse decision-making. With new concentration inequalities for both order statistics and quantiles, we propose Quantile Successive Accepts and Rejects (Q-SAR) algorithm and provide an upper bound of the probability of finding the suboptimal arms. By assuming there is a reward function over the arm (feature) space and making assumptions about the smoothness of the reward function, we consider a real-world application in synthetic biology, where we design biological sequences in batches to optimise the corresponding protein expression level. By proposing a machine learning guided Design-Build-Test-Learn cycle and utilising the Gaussian process batch upper confidence bound algorithm, we improve the protein expression level by up to 35%. Motivated by the large arm space in applications such as large-scale recommender systems and biological sequence design, we design algorithms to explore arms hierarchically. Under the assumption of Gaussian process rewards, we consider a novel setting where only the aggregated feedback of a local area can be observed, instead of the reward of each selected choice. This setting is also useful when the individual feedback is not available due to for example privacy or hardware constraints. We propose a new algorithm called Gaussian Process Optimistic Optimisation (GPOO) to address this new setting and provided an upper bound of the aggregated regret. We then consider another type of hierarchical design for the application of recommendation systems, where we propose a two-stage topic-item recommendation framework for news recommendation. We further propose policies to utilise the neural representations and conduct large-scale experiments on a news recommendation dataset, which verifies the efficiency of the two-stage framework and our proposed policies.