Label-Efficient Video and Language Representation Learning and Applications

Li, Dongxu

Label-Efficient Video and Language Representation Learning and Applications

Date

2023

Authors

Li, Dongxu

Abstract

Video and language research aims to model and analyse the two communication modalities and their connections. Learning effective video and language representation is pivotal in facilitating a wide spectrum of applications, such as content-based video retrieval, multimedia content generation, and video-based assistive technology. Modern deep learning-based video-language models require a large amount of data for supervised training. However, obtaining accurately-annotated video and language data is laborious and expensive, especially for tasks requiring domain expertise. Consequently, existing works usually show compromised results with the limited access to annotations. To this end, this thesis devises label-efficient algorithms for video and language understanding, aiming to learn good video and language representations with only a few and/or weak labels. To demonstrate the practical importance of these techniques, we also study extensively their applications on automated video sign language understanding, where annotations are scarce due to the costly domain knowledge required. The main contributions of this thesis are summarised as follows. First, we present a generic video and language pre-training framework (AlPro), which learns effective multimodal representations from video-text pairs. Instead of fully-annotated video-text pairs, we use those easily accessible from the web to reduce the demand for human labelling efforts. Specifically, our method aims at capturing alignment between video and text inputs. This is achieved by contrastively aligning unimodal video-text features at the instance level, as well as enhancing the fine-grained alignment between visual regions and textual entities. When transferring to downstream tasks, such as video-and-text retrieval and video question answering, our pre-trained model surpasses previous methods by a significant margin, while using orders of magnitude less training data. We then describe our efforts in the development of techniques and resources for automated sign language understanding and generation, a typical video and language task where labels are expensive to acquire. In particular, we study the problem of word-level sign language recognition from videos, aiming at classifying gestures of sign language "words" from videos. Training recognition models for such a task requires video samples with large variations in signer appearance, therefore, scalable datasets with labels are not commonly existent. To tackle this issue, we propose to utilise sign language news videos on public video sharing platforms as an auxiliary data source with weak labels, leading to a self- training framework. We are motivated by the observation that important visual concepts are shared across domains and propose to learn domain-invariant visual descriptors that benefit the recognition. Our method obtains significant improvement across multiple public datasets, including the largest Word-level American Sign Language recognition dataset (WLASL)