EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
Abstract
We introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for artistic media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks.
Generation and Dataset Visualizations
Curated multimodal samples from EmoVid, combining representative dataset snippets and controllable custom-role generations.
📊 Dataset Sample
Tri-modal snippets (animation, movie, sticker) for every emotion class.
🎠Custom Roles
Role-conditioned generations showcasing controllable styles.
Key Findings From EmoVid
Color and Brightness are Strong Emotional Indicators
Our analysis of the EmoVid dataset reveals a clear and quantifiable link between low-level visual attributes and high-level emotional categories. We found that positive-valence emotions (like contentment and amusement) are generally brighter and more colorful. Conversely, high-arousal emotions (like anger and excitement) also trend towards higher colorfulness but with lower brightness. This spatial layout in the color space supports the robustness of the valence-arousal model.
Fine-Tuning on EmoVid Decisively Wins User Preference
To validate the practical value of our dataset, we fine-tuned the Wan2.1 generative model on EmoVid. In a controlled perceptual user study, our fine-tuned model was overwhelmingly preferred over strong baselines (Wan-Original and CogVideoX). For Emotion Expression, our model was ranked first in 66.2% of comparisons. For Aesthetic Quality, it achieved a 57.9% Top-1 preference rate. This demonstrates EmoVid's clear utility in enhancing the emotional expressiveness of generative models.
Emotions Show Strong Temporal Persistence
A key advantage of a video dataset is the ability to study temporal dynamics. By analyzing consecutive movie clips, we built an emotion transition matrix. The analysis shows that emotions have strong self-persistence (the strong diagonal). Furthermore, transitions are far more likely to occur within the same valence (e.g., from one negative emotion to another) than across valences (e.g., from positive to negative).
Text Captions and Emotion Labels are Strongly Aligned
EmoVid is a multimodal dataset, and our analysis confirms its cross-modal consistency. We performed sentiment polarity analysis on the text captions for each video. The results show a clear pattern: videos labeled with positive emotions (e.g., amusement, excitement, contentment) have a significantly higher proportion of positive-sentiment captions, and the same holds true for negative emotions. This confirms the strong semantic link between the textual content and the perceived emotion.
BibTeX
@inproceedings{EmoVidAAAI2026,
title = {EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation},
author = {EmoVid Authors},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026},
note = {Oral},
url = {}
}