Learning for the internet : kernel embeddings and optimisation

Date

2011

Authors

Quadrianto, Novi

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In this thesis, we develop principled machine learning methods suited for complex real-world Internet challenges. The Internet has supplied an unprecedented amount of data; the challenge now is to transform this massive amount of data into information that supports knowledge creation. Machine learning techniques have become prevalent for modelling, prediction, and decision making from massive scale data. This thesis makes contributions in addressing data to knowledge transformation in the context of machine learning; we introduce non-standard machine learning problems and devise solutions for those, as well we present scalable solutions for several existing machine learning problems. The present work focuses on addressing Internet complexity on output label dimensions. The first part of this thesis deals with formulation and solution of non-standard machine learning problems. Traditionally, supervised machine learning settings draw inference and make prediction from a set of input objects; each of which is supervised by a desired output value. Internet poses challenges of weak label supervision and label inconsistency. We focus on the following three new settings: 1. Learning from only label proportions: A learning setting where instead of each input is supervised with an output, we are given groups of unlabelled inputs. Each group is endowed with information on class label proportions. The number of group is at least as many as number of labels (Chapter 3); 2. Learning input-output correspondences: A learning setting where a set of inputs and a set of outputs are given however they are not paired (Chapter 4); 3. Learning from several related tasks with distinct label sets: A learning setting where several related tasks are given however each task has potentially distinct label sets and label correspondences are not readily available (Chapter 5). The second part addresses refinements of existing machine learning models and algorithms to scale to large data. The contributions of this thesis include a streaming algorithm for the following two problems: 1. Transductive learning: We present a scalable algorithm for learning with labelled and unlabelled data simultaneously by matching the output distributions on labelled and unlabelled data (Chapter 6); 2. Storage and indexing management: We present a scalable algorithm for indexing in the context of webpage tiering. The goal is to allocate pages to caches such that the most frequently accessed pages reside in the caches with the smallest latency whereas the least frequently retrieved pages are stored in the backtiers of the caching system. This indexing and storage problem is related to a larger class of parametric flow problem (Chapter 7). The solutions presented in this thesis are centred around two main mathematical ingredients. First, Hilbert Space embeddings of distributions via averages are used. This allows distance computation between data distributions in terms of distances between averages, which, in turn, yield elegant ways to deal with distributions without the need of estimating them as an intermediate step. Second, recent advances in field of optimisation are exploited to address the sheer size and the non-convex nature of mostly Internet problems.

Description

Keywords

Citation

Source

Type

Thesis (PhD)

Book Title

Entity type

Access Statement

Open Access

License Rights

Restricted until