Adaptive Recommendations with Bandit Feedback
Date
2023
Authors
Zhang, Mengyan
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Sequential decision-making under uncertainty is a fundamental challenge in machine learning. We consider multi-armed bandits problems, where the goal is to design a policy to recommend arms (options) sequentially where in each round only the rewards sampled from the selected arms are available. In this thesis, we study both theoretical and practical aspects of sequential decision-making with bandit feedback. From the theory aspect, we summarise the rewards using quantiles for risk-averse decision-making and selecting a local area of arm space with only aggregated feedback for a new aggregated regret. From the practical side, we study the best arms identification for biological sequence design in real-world applications and focus on the regret minimisation setting for large-scale news recommendations. There are three design choices one can make in bandit tasks, namely arms, objectives and rewards. The theoretical and practical contributions of this thesis cover and enrich the three design choices. We first consider the design choice of arm optimality and propose to use quantiles to summarise the reward distributions. This is under the task of best arms identification with a fixed budget for risk-averse decision-making. With new concentration inequalities for both order statistics and quantiles, we propose Quantile Successive Accepts and Rejects (Q-SAR) algorithm and provide an upper bound of the probability of finding the suboptimal arms. By assuming there is a reward function over the arm (feature) space and making assumptions about the smoothness of the reward function, we consider a real-world application in synthetic biology, where we design biological sequences in batches to optimise the corresponding protein expression level. By proposing a machine learning guided Design-Build-Test-Learn cycle and utilising the Gaussian process batch upper confidence bound algorithm, we improve the protein expression level by up to 35%. Motivated by the large arm space in applications such as large-scale recommender systems and biological sequence design, we design algorithms to explore arms hierarchically. Under the assumption of Gaussian process rewards, we consider a novel setting where only the aggregated feedback of a local area can be observed, instead of the reward of each selected choice. This setting is also useful when the individual feedback is not available due to for example privacy or hardware constraints. We propose a new algorithm called Gaussian Process Optimistic Optimisation (GPOO) to address this new setting and provided an upper bound of the aggregated regret. We then consider another type of hierarchical design for the application of recommendation systems, where we propose a two-stage topic-item recommendation framework for news recommendation. We further propose policies to utilise the neural representations and conduct large-scale experiments on a news recommendation dataset, which verifies the efficiency of the two-stage framework and our proposed policies.
Description
Keywords
Citation
Collections
Source
Type
Thesis (PhD)
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material