Weakly Supervised Vision and Language Representation Learning in Sign Language Understanding
Abstract
Deep learning research has shaped many industry fields and is playing a driving role in real-life applications. However, the demand for high quality and large volume data is a major obstacle to adoption in many areas. The impressive results obtained on typical benchmarks (e.g., ImageNet and WMT) require millions of annotated examples. The performance is often worse when the model is applied to new environment, casting doubts onto the practical usefulness of these models. This thesis gives evidence of how deep learning can be used in a weakly supervised setting, where the data is limited either in quantity or quality (data noise). Models that use data more efficiently can be used in a broader range of applications and give us more confidence in their capacity for generalization over memorization. Weakly supervised learning literatures concern problems of three types: incomplete supervision, inexact supervision and noisy supervision to be covered in the thesis. I focus on multi-modal representation learning and sign language modeling of video inputs. This multi-modal data context combines additional data dimensions, e.g., visual information from sign language video to the text, with existing datasets limited in both quality and quantity. As a result, the work described in this thesis demonstrates how ideas from multi-modal modeling can be applied to a real weakly-supervised learning context.
Chapter 3 shows the benefits of incorporating inductive priors which arise from the data modality, e.g. objects move smoothly in videos and larger object to be closer to the viewer. I describe a model that leverages the intra- and inter- frame relations embedded in video data. Sign language translation is an ideal evaluation context as it demands a greater temporal understanding of all the frames compared to short clips in other video tasks. The results show that this approach improves translation performance by a clear margin over the SOTA.
Chapter 4 shows how weakly supervised modeling can benefit from tasks where it is possible to synthesize data. Video deblur enjoys such feature as plausible blur can be generated from modeling. The improved video quality from a deblurring model leads to more robust performance in downstream tasks, especially with limited data. In comparison with conventional methods focusing on local pixels, I show how a pyramid feature aggregation structure allows it to find correspondence between distant pixels, giving superior performance in multiple video deblur benchmarks.
Chapter 5 describes the user study I conducted on the practical usability of our sign language recognition methods, with the goal of understanding how people perceive the results from models trained in weakly supervised setting. I show the first end-to-end sign language dictionary which detects sign words directly from webcam video capture. Participants include both experienced and junior sign language learners, and the results suggest that the tool is favored by them, validating the usefulness of our weakly supervised learning model.
Previous work on weakly supervised learning has primarily involved simulated tasks, e.g. by restricting the model to only use small portion of the available data. Although this makes it easier to examine the model's capacity in recovering the performance from limited resources, they present different challenges which may arise in real problems naturally suffering limited data, e.g. sign language. In contrast, this thesis is concerned with real weakly supervised learning problems for which no "high resource" data exists. I have explored approaches to improve model performance and verified their usefulness by a direct study with the target users. These contributions have been published in related conferences. My hope is to motivate more people to work on real-world weakly supervised problems, which often concern less supported communities such as sign language users and make deep learning more humanized.
Description
Keywords
Citation
Collections
Source
Type
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material