VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu¹*, Haoyu Wu^1,2*, Ziqiang Zheng¹, Chen Wei³, Yingqing He¹, Renjie Pi¹, Qifeng Chen¹

¹HKUST, ²Renmin University of China, ³Johns Hopkins University

* Equal Contribution. Work completed during Haoyu's internship at HKUST.

CVPR 2025

Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected.

Results

Analysis of OmniScore on videos from VC2. (a) The difference between the maximum and minimum OmniScore among N videos as N increases. (b) Histogram of OmniScore. (c) Histogram of the difference in OmniScore between two samples in a preference pair. (d) Correlation heatmap of the OmniScore across dimensions.

VideoDPO alignment performance. We apply our proposed VideoDPO on three state-of-the-art open-source models and evaluate performance on VBench, HPS (V), and PickScore. After training with VideoDPO, all models achieve the best performance on VBench, with improvements also observed on HPS (V) or PickScore, demonstrating effectiveness.

Comparison of sub-dimension scores before and after alignment on VBench for VC2, T2V-Turbo, and CogVideo.

Ablation studies. We study strategies and configurations: (a) pair strategy, (b) filter strategy, (c) α values, (d) N values. Q = visual quality, S = semantic alignment.

Acknowledgement

Our work is developed on the following open-source projects, we would like to express our sincere thanks to their contributions:

VideoCrafter2, T2V-turbo, CogVideoX, VideoTuna, VBench, VidProM.

Thanks to I Chieh Chen for valuable suggestions on demos.

Citation

@inproceedings{liu2025videodpo,
  title={Videodpo: Omni-preference alignment for video diffusion generation},
  author={Liu, Runtao and Wu, Haoyu and Zheng, Ziqiang and Wei, Chen and He, Yingqing and Pi, Renjie and Chen, Qifeng},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={8009--8019},
  year={2025}
}

Gallery

Before vs After Alignment samples (VideoCrafter2). Click a GIF to view.

Before Alignment	After Alignment