Audio-Visual Segmentation with Semantics

dc.contributor.authorZhou, Jinxingen
dc.contributor.authorShen, Xuyangen
dc.contributor.authorWang, Jianyuanen
dc.contributor.authorZhang, Jiayien
dc.contributor.authorSun, Weixuanen
dc.contributor.authorZhang, Jingen
dc.contributor.authorBirchfield, Stanen
dc.contributor.authorGuo, Danen
dc.contributor.authorKong, Lingpengen
dc.contributor.authorWang, Mengen
dc.contributor.authorZhong, Yiranen
dc.date.accessioned2025-05-23T15:20:54Z
dc.date.available2025-05-23T15:20:54Z
dc.date.issued2024en
dc.description.abstractWe propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.en
dc.description.sponsorshipThis work was supported by the National Key R&D Program of China (NO.2022YFB4500601), the National Natural Science Foundation of China (72188101, 62272144, 62020106007, and U20A20183), the Major Project of Anhui Province (202203a05020011), and the Fundamental Research Funds for the Central Universities (JZ2024HGTG0309, JZ2024AHST0337, JZ2023YQTD0072).en
dc.description.statusPeer-revieweden
dc.identifier.issn0920-5691en
dc.identifier.scopus85206798470en
dc.identifier.urihttp://www.scopus.com/inward/record.url?scp=85206798470&partnerID=8YFLogxKen
dc.identifier.urihttps://hdl.handle.net/1885/733752517
dc.language.isoenen
dc.rightsPublisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.en
dc.sourceInternational Journal of Computer Visionen
dc.subjectAudio-visual learningen
dc.subjectAudio-visual segmentationen
dc.subjectAVSBenchen
dc.subjectMulti-modal segmentationen
dc.subjectSemantic segmentationen
dc.subjectVideo segmentationen
dc.titleAudio-Visual Segmentation with Semanticsen
dc.typeJournal articleen
dspace.entity.typePublicationen
local.contributor.affiliationZhou, Jinxing; Hefei University of Technologyen
local.contributor.affiliationShen, Xuyang; Shanghai Ai Laben
local.contributor.affiliationWang, Jianyuan; University of Oxforden
local.contributor.affiliationZhang, Jiayi; Beihang Universityen
local.contributor.affiliationSun, Weixuan; Shanghai Ai Laben
local.contributor.affiliationZhang, Jing; School of Computing, ANU College of Systems and Society, The Australian National Universityen
local.contributor.affiliationBirchfield, Stan; NVIDIAen
local.contributor.affiliationGuo, Dan; Hefei University of Technologyen
local.contributor.affiliationKong, Lingpeng; The University of Hong Kongen
local.contributor.affiliationWang, Meng; Hefei University of Technologyen
local.contributor.affiliationZhong, Yiran; Shanghai Ai Laben
local.identifier.doi10.1007/s11263-024-02261-xen
local.identifier.purebb065b7d-f6ac-43e2-9c6d-9042f34e80a9en
local.identifier.urlhttps://www.scopus.com/pages/publications/85206798470en
local.type.statusAccepted/In pressen

Downloads