SATO: Stable Text-to-Motion Framework.

Chen, Wenshuo; Xiao, Hongru; Zhang, Erhang; Hu, Lijie; Wang, Lei; Liu, Mengyuan; Chen, Chen

SATO: Stable Text-to-Motion Framework.

dc.contributor.author	Chen, Wenshuo	en
dc.contributor.author	Xiao, Hongru	en
dc.contributor.author	Zhang, Erhang	en
dc.contributor.author	Hu, Lijie	en
dc.contributor.author	Wang, Lei	en
dc.contributor.author	Liu, Mengyuan	en
dc.contributor.author	Chen, Chen	en
dc.date.accessioned	2025-05-31T02:27:52Z
dc.date.available	2025-05-31T02:27:52Z
dc.date.issued	2024	en
dc.description.abstract	Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-tomotion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance. Codes and models are released at	en
dc.description.status	Peer-reviewed	en
dc.format.extent	9	en
dc.identifier.other	dblp:conf/mm/ChenXZH0L024	en
dc.identifier.other	ORCID:/0000-0002-8600-7099/work/171405397	en
dc.identifier.scopus	85206697200	en
dc.identifier.uri	https://hdl.handle.net/1885/733755807
dc.language.iso	en	en
dc.relation.ispartof	ACM Multimedia	en
dc.rights	DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.	en
dc.title	SATO: Stable Text-to-Motion Framework.	en
dc.type	Conference paper	en
dspace.entity.type	Publication	en
local.bibliographicCitation.lastpage	6997	en
local.bibliographicCitation.startpage	6989	en
local.contributor.affiliation	Chen, Wenshuo; Shandong University	en
local.contributor.affiliation	Xiao, Hongru; Tongji University	en
local.contributor.affiliation	Zhang, Erhang; Shandong University	en
local.contributor.affiliation	Hu, Lijie; King Abdullah University of Science and Technology	en
local.contributor.affiliation	Wang, Lei; School of Computing, ANU College of Systems and Society, The Australian National University	en
local.contributor.affiliation	Liu, Mengyuan; Peking University	en
local.contributor.affiliation	Chen, Chen; University of Central Florida	en
local.identifier.doi	10.1145/3664647.3681034	en
local.identifier.pure	bec8b9b3-76c4-4db5-8d5e-c04986a76321	en
local.type.status	Published	en

Collections

ANU Research Publications

Cultural advice

SATO: Stable Text-to-Motion Framework.

Downloads

Collections