Yang, Hongtao
Description
One of the key advantages of supervised deep learning over conventional machine learning is that the learning process of a neural network does not rely directly on hand-crafted features. There is no denying that the availability of large-scale annotated data plays a vital role in the success of deep learning, among many other others.
On the other hand, given the capacity of deep neural networks, modern deep learning system is able to approximate complex functions and fit complex...[Show more] distributions of data of all kinds. As a result, the generalisability of a trained deep neural network is closely tied with the quality of its training data and annotation. Thus, the application and success of deep learning system are restrained by two main factors:
1. High quality annotation of diverse data is expensive to obtain, both in terms of time and labour.
2. Human annotations are inherently biased, leading to over-fitting and learning sub-optimal features.
It would be nice if we can 1. extract more useful information with existing valuable annotations; 2. achieve similar performance with weaker level of supervision, such as less amount of annotated data or indirect supervision from implicit annotations that is cheap to obtain.
To this end, our line of works widely explores various computer vision tasks, including action recognition, action localisation, disentangled representation learning and novel image generation, across different levels of supervision distinguished by data scarcity. Our journey begins with fully supervised action localisation with the aim to better utilise precious annotations. Then we investigate one-shot action recognition and localisation that aims to learn robust visual descriptors from very few annotated samples. Finally, we move from discriminative learning to generative learning with the purpose of exploring unsupervised feature disentanglement and image generation. In particular, our works can be divided into 4 parts:
We designed and constructed a large-scale RGB-D video dataset of gym activities to facilitate fully supervised video understanding. Contrary to existing public datasets, all actions in our dataset have clearly defined boundaries, which contributes to less noise in annotation and thus is better fitted for video representation learning.
We then propose an instance-aware video labeling framework that explore the synergy of instance-level information and frame-level information. In this endeavour, we aim to take full advantage of precious annotations and combine multi-level information to mitigate the inherent noise in temporal action localisation. Our fusion method is proven to capture complementary information and better localisation performance is reported.
After that, we propose a one/few shot action localisation method leveraging meta-learning concepts, aiming to make better use of very rare and limited data. We designed a similarity-based action localisation network under the meta-learning framework that can learn transferable and class-agnostic video descriptors. We also designed structured representation for time series that takes into account the natural evolution of actions. As a result, we achieve state-of-the-art performance in both one/few shot action recognition and localisation.
Further stripping down the strength of supervision, we propose an unsupervised disentangled representation learning framework that can generate novel images with desired attributes. Through this unsupervised exploration, our network can learn structured meaningful features from only implicit cycle and GAN-based constraints. The reported results are comparable or even better than its fully-supervised counterparts.
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.