Training A Small Emotional Vision Language Model for Visual Art Comprehension

Zhang, Jing; Zheng, Liang; Wang, Meng; Guo, Dan

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Date

2025

Authors

Zhang, Jing

Zheng, Liang

Wang, Meng

Guo, Dan

Publisher

Springer Science+Business Media B.V.

Abstract

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.

Keywords

Emotion understanding, Small vision language models, Valence-Arousal-Dominance (VAD) emotion modeling

URI

http://www.scopus.com/inward/record.url?scp=85209596290&partnerID=8YFLogxK
https://hdl.handle.net/1885/733752356

Collections

ANU Research Publications

Type

Conference paper

Book Title

Computer Vision – ECCV 2024 - 18th European Conference, Proceedings

Entity type

Publication

DOI

10.1007/978-3-031-72855-6_23

Full item page

Cultural advice

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Access Statement

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until