SATO: Stable Text-to-Motion Framework.

dc.contributor.authorChen, Wenshuoen
dc.contributor.authorXiao, Hongruen
dc.contributor.authorZhang, Erhangen
dc.contributor.authorHu, Lijieen
dc.contributor.authorWang, Leien
dc.contributor.authorLiu, Mengyuanen
dc.contributor.authorChen, Chenen
dc.date.accessioned2025-05-31T02:27:52Z
dc.date.available2025-05-31T02:27:52Z
dc.date.issued2024en
dc.description.abstractIs the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-tomotion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance. Codes and models are released aten
dc.description.statusPeer-revieweden
dc.format.extent9en
dc.identifier.otherdblp:conf/mm/ChenXZH0L024en
dc.identifier.otherORCID:/0000-0002-8600-7099/work/171405397en
dc.identifier.scopus85206697200en
dc.identifier.urihttps://hdl.handle.net/1885/733755807
dc.language.isoenen
dc.relation.ispartofACM Multimediaen
dc.rightsDBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.en
dc.titleSATO: Stable Text-to-Motion Framework.en
dc.typeConference paperen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage6997en
local.bibliographicCitation.startpage6989en
local.contributor.affiliationChen, Wenshuo; Shandong Universityen
local.contributor.affiliationXiao, Hongru; Tongji Universityen
local.contributor.affiliationZhang, Erhang; Shandong Universityen
local.contributor.affiliationHu, Lijie; King Abdullah University of Science and Technologyen
local.contributor.affiliationWang, Lei; School of Computing, ANU College of Systems and Society, The Australian National Universityen
local.contributor.affiliationLiu, Mengyuan; Peking Universityen
local.contributor.affiliationChen, Chen; University of Central Floridaen
local.identifier.doi10.1145/3664647.3681034en
local.identifier.purebec8b9b3-76c4-4db5-8d5e-c04986a76321en
local.identifier.urlhttps://www.scopus.com/pages/publications/85206697200en
local.type.statusPublisheden

Downloads