SATO: Stable Text-to-Motion Framework.
Date
Authors
Chen, Wenshuo
Xiao, Hongru
Zhang, Erhang
Hu, Lijie
Wang, Lei
Liu, Mengyuan
Chen, Chen
Journal Title
Journal ISSN
Volume Title
Publisher
Access Statement
Abstract
Is the Text to Motion model robust? Recent advancements in Text
to Motion models primarily stem from more accurate predictions of
specific actions. However, the text modality typically relies solely on
pre-trained Contrastive Language-Image Pretraining (CLIP) models.
Our research has uncovered a significant issue with the text-tomotion model: its predictions often exhibit inconsistent outputs,
resulting in vastly different or even incorrect poses when presented
with semantically similar or identical text inputs. In this paper, we
undertake an analysis to elucidate the underlying causes of this
instability, establishing a clear link between the unpredictability of
model outputs and the erratic attention patterns of the text encoder
module. Consequently, we introduce a formal framework aimed
at addressing this issue, which we term the Stable Text-to-Motion
Framework (SATO). SATO consists of three modules, each dedicated
to stable attention, stable prediction, and maintaining a balance
between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention
and prediction. To verify the stability of the model, we introduced a
new textual synonym perturbation dataset based on HumanML3D
and KIT-ML. Results show that SATO is significantly more stable
against synonyms and other slight perturbations while keeping
its high accuracy performance. Codes and models are released at
Description
Keywords
Citation
Collections
Source
Type
Book Title
ACM Multimedia
Entity type
Publication