See more, know more: Unsupervised video object segmentation with co-attention siamese networks

dc.contributor.authorLu, Xiankaien
dc.contributor.authorWang, Wenguanen
dc.contributor.authorMa, Chaoen
dc.contributor.authorShen, Jianbingen
dc.contributor.authorShao, Lingen
dc.contributor.authorPorikli, Fatihen
dc.date.accessioned2025-05-23T23:23:36Z
dc.date.available2025-05-23T23:23:36Z
dc.date.issued2019en
dc.description.abstractWe introduce a novel network, called as CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin. We will publicly release our implementation and models.en
dc.description.sponsorshipBy regarding UVOS as a temporal coherence capturing task, we proposed a novel model, COSNet, to estimate the primary target(s). Through an alternated network training strategy with saliency image and video pairs, the proposed network learns to discriminate primary objects from the background in each frame and capture the temporal correlation across frames. The proposed method achieved superior performance on three representative video segmentation datasets. Extensive experimental results proved that our method can effectively suppress similar target distraction despite no annotation being given during the segmentation. The COSNet is a general framework for handling sequential data learning, and can be readily extended to other video analysis tasks, such as video saliency detection and optical flow estimation. Acknowledgements This work was supported in part by the National Key Research and Development Program of China (2016YFB1001003), STCSM(18DZ1112300), and the Australian Research Council’s Discovery Projects funding scheme (DP150104645).en
dc.description.statusPeer-revieweden
dc.format.extent10en
dc.identifier.isbn9781728132938en
dc.identifier.issn1063-6919en
dc.identifier.scopus85070533762en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85070533762&partnerID=8YFLogxKen
dc.identifier.urihttps://hdl.handle.net/1885/733753192
dc.language.isoenen
dc.publisherIEEE Computer Societyen
dc.relation.ispartofProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019en
dc.relation.ispartofseries32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019en
dc.relation.ispartofseriesProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognitionen
dc.rightsPublisher Copyright: © 2019 IEEE.en
dc.subjectMotion and Trackingen
dc.subjectVideo Analyticsen
dc.subjectVision Applications and Systemsen
dc.titleSee more, know more: Unsupervised video object segmentation with co-attention siamese networksen
dc.typeConference paperen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage3627en
local.bibliographicCitation.startpage3618en
local.contributor.affiliationLu, Xiankai; Inception Institute of Artificial Intelligenceen
local.contributor.affiliationWang, Wenguan; Inception Institute of Artificial Intelligenceen
local.contributor.affiliationMa, Chao; Shanghai Jiao Tong Universityen
local.contributor.affiliationShen, Jianbing; Inception Institute of Artificial Intelligenceen
local.contributor.affiliationShao, Ling; Inception Institute of Artificial Intelligenceen
local.contributor.affiliationPorikli, Fatih; School of Engineering, ANU College of Systems and Society, The Australian National Universityen
local.identifier.ariespublicationa383154xPUB11770en
local.identifier.doi10.1109/CVPR.2019.00374en
local.identifier.pure76556323-74d7-4a73-9c9e-9d93b37bdba4en
local.identifier.urlhttps://www.scopus.com/pages/publications/85070533762en
local.type.statusPublisheden

Downloads