EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

HKUST(GZ) Fudan University HKUST
AAAI 2026 Oral
*Indicates Equal Contribution
EmoVid overview

EmoVid provides a multimodal, emotion-centric video benchmark for both understanding and generation tasks.

Abstract

We introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for artistic media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks.

Generation and Dataset Visualizations

Curated multimodal samples from EmoVid, combining representative dataset snippets and controllable custom-role generations.

📊 Dataset Sample

Tri-modal snippets (animation, movie, sticker) for every emotion class.

🎭 Custom Roles

Role-conditioned generations showcasing controllable styles.

Key Findings From EmoVid

Color and brightness attributes per emotion

Color and Brightness are Strong Emotional Indicators

Our analysis of the EmoVid dataset reveals a clear and quantifiable link between low-level visual attributes and high-level emotional categories. We found that positive-valence emotions (like contentment and amusement) are generally brighter and more colorful. Conversely, high-arousal emotions (like anger and excitement) also trend towards higher colorfulness but with lower brightness. This spatial layout in the color space supports the robustness of the valence-arousal model.

User study results

Fine-Tuning on EmoVid Decisively Wins User Preference

To validate the practical value of our dataset, we fine-tuned the Wan2.1 generative model on EmoVid. In a controlled perceptual user study, our fine-tuned model was overwhelmingly preferred over strong baselines (Wan-Original and CogVideoX). For Emotion Expression, our model was ranked first in 66.2% of comparisons. For Aesthetic Quality, it achieved a 57.9% Top-1 preference rate. This demonstrates EmoVid's clear utility in enhancing the emotional expressiveness of generative models.

Emotion transition matrix

Emotions Show Strong Temporal Persistence

A key advantage of a video dataset is the ability to study temporal dynamics. By analyzing consecutive movie clips, we built an emotion transition matrix. The analysis shows that emotions have strong self-persistence (the strong diagonal). Furthermore, transitions are far more likely to occur within the same valence (e.g., from one negative emotion to another) than across valences (e.g., from positive to negative).

Top-5 phrases per emotion

Text Captions and Emotion Labels are Strongly Aligned

EmoVid is a multimodal dataset, and our analysis confirms its cross-modal consistency. We performed sentiment polarity analysis on the text captions for each video. The results show a clear pattern: videos labeled with positive emotions (e.g., amusement, excitement, contentment) have a significantly higher proportion of positive-sentiment captions, and the same holds true for negative emotions. This confirms the strong semantic link between the textual content and the perceived emotion.

BibTeX

@inproceedings{EmoVidAAAI2026,
  title     = {EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation},
  author    = {EmoVid Authors},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
  note      = {Oral},
  url       = {}
}