Discriminatively Learned Hierarchical Rank Pooling Networks
Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action...[Show more]
|Rank pooling is a temporal encoding method that summarizes the dynamics of a video sequence to a single vector which has shown good results in human action recognition in prior work. In this work, we present novel temporal encoding methods for action and activity classification by extending the unsupervised rank pooling temporal encoding method in two ways. First, we present discriminative rank pooling in which the shared weights of our video representation and the parameters of the action classifiers are estimated jointly for a given training dataset of labelled vector sequences using a bilevel optimization formulation of the learning problem. When the frame level features vectors are obtained from a convolutional neural network (CNN), we rank pool the network activations and jointly estimate all parameters of the model, including CNN filters and fully-connected weights, in an end-to-end manner which we coined as end-to-end trainable rank pooled CNN. Importantly, this model can make use of any existing convolutional neural network architecture (e.g., AlexNet or VGG) without modification or introduction of additional parameters. Then, we extend rank pooling to a high capacity video representation, called hierarchical rank pooling. Hierarchical rank pooling consists of a network of rank pooling functions, which encode temporal semantics over arbitrary long video clips based on rich frame level features. By stacking non-linear feature functions and temporal sub-sequence encoders one on top of the other, we build a high capacity encoding network of the dynamic behaviour of the video. The resulting video representation is a fixed-length feature vector describing the entire video clip that can be used as input to standard machine learning classifiers. We demonstrate our approach on the task of action and activity recognition. We present a detailed analysis of our approach against competing methods and explore variants such as hierarchy depth and choice of non-linear feature function. Obtained results are comparable to state-of-the-art methods on three important activity recognition benchmarks with classification performance of 76.7% mAP on Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.
|This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016).
|© 2017 Springer Science+Business Media, LLC
|International Journal of Computer Vision
|Convolutional neural networks
|Discriminatively Learned Hierarchical Rank Pooling Networks
|Imported from ARIES
|091599 - Interdisciplinary Engineering not elsewhere classified
|Fernando, Basura, College of Engineering and Computer Science, ANU
|Gould, Stephen, College of Engineering and Computer Science, ANU
|https://v2.sherpa.ac.uk/id/publication/13404..."Author Accepted Manuscript can be made open access on institutional repository after 12 month embargo" from SHERPA/RoMEO site (as at 4.11.2021).
|ANU Research Publications
|Author Accepted Manuscript
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.