Audio-Visual Segmentation with Semantics

Zhou, Jinxing; Shen, Xuyang; Wang, Jianyuan; Zhang, Jiayi; Sun, Weixuan; Zhang, Jing; Birchfield, Stan; Guo, Dan; Kong, Lingpeng; Wang, Meng; Zhong, Yiran

Audio-Visual Segmentation with Semantics

dc.contributor.author	Zhou, Jinxing	en
dc.contributor.author	Shen, Xuyang	en
dc.contributor.author	Wang, Jianyuan	en
dc.contributor.author	Zhang, Jiayi	en
dc.contributor.author	Sun, Weixuan	en
dc.contributor.author	Zhang, Jing	en
dc.contributor.author	Birchfield, Stan	en
dc.contributor.author	Guo, Dan	en
dc.contributor.author	Kong, Lingpeng	en
dc.contributor.author	Wang, Meng	en
dc.contributor.author	Zhong, Yiran	en
dc.date.accessioned	2025-05-23T15:20:54Z
dc.date.available	2025-05-23T15:20:54Z
dc.date.issued	2024	en
dc.description.abstract	We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.	en
dc.description.sponsorship	This work was supported by the National Key R&D Program of China (NO.2022YFB4500601), the National Natural Science Foundation of China (72188101, 62272144, 62020106007, and U20A20183), the Major Project of Anhui Province (202203a05020011), and the Fundamental Research Funds for the Central Universities (JZ2024HGTG0309, JZ2024AHST0337, JZ2023YQTD0072).	en
dc.description.status	Peer-reviewed	en
dc.identifier.issn	0920-5691	en
dc.identifier.scopus	85206798470	en
dc.identifier.uri	http://www.scopus.com/inward/record.url?scp=85206798470&partnerID=8YFLogxK	en
dc.identifier.uri	https://hdl.handle.net/1885/733752517
dc.language.iso	en	en
dc.rights	Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.	en
dc.source	International Journal of Computer Vision	en
dc.subject	Audio-visual learning	en
dc.subject	Audio-visual segmentation	en
dc.subject	AVSBench	en
dc.subject	Multi-modal segmentation	en
dc.subject	Semantic segmentation	en
dc.subject	Video segmentation	en
dc.title	Audio-Visual Segmentation with Semantics	en
dc.type	Journal article	en
dspace.entity.type	Publication	en
local.contributor.affiliation	Zhou, Jinxing; Hefei University of Technology	en
local.contributor.affiliation	Shen, Xuyang; Shanghai Ai Lab	en
local.contributor.affiliation	Wang, Jianyuan; University of Oxford	en
local.contributor.affiliation	Zhang, Jiayi; Beihang University	en
local.contributor.affiliation	Sun, Weixuan; Shanghai Ai Lab	en
local.contributor.affiliation	Zhang, Jing; School of Computing, ANU College of Systems and Society, The Australian National University	en
local.contributor.affiliation	Birchfield, Stan; NVIDIA	en
local.contributor.affiliation	Guo, Dan; Hefei University of Technology	en
local.contributor.affiliation	Kong, Lingpeng; The University of Hong Kong	en
local.contributor.affiliation	Wang, Meng; Hefei University of Technology	en
local.contributor.affiliation	Zhong, Yiran; Shanghai Ai Lab	en
local.identifier.doi	10.1007/s11263-024-02261-x	en
local.identifier.pure	bb065b7d-f6ac-43e2-9c6d-9042f34e80a9	en
local.identifier.url	https://www.scopus.com/pages/publications/85206798470	en
local.type.status	Accepted/In press	en

Collections

ANU Research Publications

Cultural advice

Audio-Visual Segmentation with Semantics

Downloads

Collections