See more, know more: Unsupervised video object segmentation with co-attention siamese networks

Lu, Xiankai; Wang, Wenguan; Ma, Chao; Shen, Jianbing; Shao, Ling; Porikli, Fatih

See more, know more: Unsupervised video object segmentation with co-attention siamese networks

dc.contributor.author	Lu, Xiankai	en
dc.contributor.author	Wang, Wenguan	en
dc.contributor.author	Ma, Chao	en
dc.contributor.author	Shen, Jianbing	en
dc.contributor.author	Shao, Ling	en
dc.contributor.author	Porikli, Fatih	en
dc.date.accessioned	2025-05-23T23:23:36Z
dc.date.available	2025-05-23T23:23:36Z
dc.date.issued	2019	en
dc.description.abstract	We introduce a novel network, called as CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin. We will publicly release our implementation and models.	en
dc.description.sponsorship	By regarding UVOS as a temporal coherence capturing task, we proposed a novel model, COSNet, to estimate the primary target(s). Through an alternated network training strategy with saliency image and video pairs, the proposed network learns to discriminate primary objects from the background in each frame and capture the temporal correlation across frames. The proposed method achieved superior performance on three representative video segmentation datasets. Extensive experimental results proved that our method can effectively suppress similar target distraction despite no annotation being given during the segmentation. The COSNet is a general framework for handling sequential data learning, and can be readily extended to other video analysis tasks, such as video saliency detection and optical flow estimation. Acknowledgements This work was supported in part by the National Key Research and Development Program of China (2016YFB1001003), STCSM(18DZ1112300), and the Australian Research Council’s Discovery Projects funding scheme (DP150104645).	en
dc.description.status	Peer-reviewed	en
dc.format.extent	10	en
dc.identifier.isbn	9781728132938	en
dc.identifier.issn	1063-6919	en
dc.identifier.scopus	85070533762	en
dc.identifier.uri	http://www.scopus.com/inward/record.url?scp=85070533762&partnerID=8YFLogxK	en
dc.identifier.uri	https://hdl.handle.net/1885/733753192
dc.language.iso	en	en
dc.publisher	IEEE Computer Society	en
dc.relation.ispartof	Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019	en
dc.relation.ispartofseries	32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019	en
dc.relation.ispartofseries	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	en
dc.rights	Publisher Copyright: © 2019 IEEE.	en
dc.subject	Motion and Tracking	en
dc.subject	Video Analytics	en
dc.subject	Vision Applications and Systems	en
dc.title	See more, know more: Unsupervised video object segmentation with co-attention siamese networks	en
dc.type	Conference paper	en
dspace.entity.type	Publication	en
local.bibliographicCitation.lastpage	3627	en
local.bibliographicCitation.startpage	3618	en
local.contributor.affiliation	Lu, Xiankai; Inception Institute of Artificial Intelligence	en
local.contributor.affiliation	Wang, Wenguan; Inception Institute of Artificial Intelligence	en
local.contributor.affiliation	Ma, Chao; Shanghai Jiao Tong University	en
local.contributor.affiliation	Shen, Jianbing; Inception Institute of Artificial Intelligence	en
local.contributor.affiliation	Shao, Ling; Inception Institute of Artificial Intelligence	en
local.contributor.affiliation	Porikli, Fatih; School of Engineering, ANU College of Systems and Society, The Australian National University	en
local.identifier.ariespublication	a383154xPUB11770	en
local.identifier.doi	10.1109/CVPR.2019.00374	en
local.identifier.pure	76556323-74d7-4a73-9c9e-9d93b37bdba4	en
local.identifier.url	https://www.scopus.com/pages/publications/85070533762	en
local.type.status	Published	en

Collections

ANU Research Publications

See more, know more: Unsupervised video object segmentation with co-attention siamese networks

Downloads

Collections