🔥ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Jiazi Bu^1,3* Pengyang Ling^2,3* Pan Zhang^3† Tong Wu⁴

Xiaoyi Dong³ Yuhang Zang³ Yuhang Cao³ Dahua Lin^3,4,6 Jiaqi Wang^3,5†

¹ Shanghai Jiao Tong University ² University of Science and Technology of China ³ Shanghai AI Laboratory
⁴ The Chinese University of Hong Kong ⁵ Shanghai Innovation Institute ⁶ CPII under InnoHK
(^* Equal Contribution ^† Corresponding Author)

[Paper] [Code]

Abstract

The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present ByTheWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, ByTheWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that ByTheWay significantly improves the quality of text-to-video generation with negligible additional cost. Our code is available at ByTheWay Repo.

Observation and Motivation

(1) A correlation is identified between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Video generation processes exhibiting structurally implausible and temporally inconsistent artifacts demonstrate greater disparity between the temporal attention maps of different decoder blocks.

(2) The energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Specifically, videos that exhibit a higher degree of motion amplitude and a richer variety of motion patterns are observed to possess greater energy within their temporal attention maps.

Methodology

ByTheWay is composed of two principal components: Temporal Self-Guidance and Fourier-based Motion Enhancement, both meticulously engineered to refine the temporal attention module within T2V models. Temporal Self-Guidance leverages the temporal attention map from the preceding block to inform and regulate that of the current block, which effectively mitigates the disparity between the temporal attention maps across various decoder blocks, thereby normalizing their disparity. Fourier-based Motion Enhancement modulates the high-frequency components of the temporal attention map, thereby amplifying the energy of the map, circumventing the generation of videos that closely resemble static image.

Qualitative Comparison

Qualitative comparisons with vanilla results. With the integration of ByTheWay, various T2V backbones (AnimateDiff and VideoCrafter2 here) demonstrates a notable performance improvement compared to their vanilla synthesis results

Quantitative Evaluation

Quantitative results of ByTheWay on VBench. ByTheWay facilitates the best performance of different T2V models

Ablation and Analysis

Effect of α. α represents the infusion ratio of lower-level attention information in Temporal Self-Guidance. An appropriate α strengthens the temporal consistency of the video, but an excessively large α may lead to the loss of motion information.

Effect of β. β stands for the scaling factor of the highfrequency components in the temporal attention map within Fourier-based Motion Enhancement. An appropriate β introduces richer and more intensified motion to the video, but an excessively large β may cause the emergence of unexpected motion artifacts.

Effect of τ. τ denotes the number of discrete frequency components involved in Fourier-based Motion Enhancement. A larger τ allows for the manipulation of more frequency components that encode motion, thus promoting the motion enhancement effect.

BibTex

If you find this work helpful, please cite the following paper:

    @inproceedings{bu2025bytheway,
      title={ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way},
      author={Bu, Jiazi and Ling, Pengyang and Zhang, Pan and Wu, Tong and Dong, Xiaoyi and Zang, Yuhang and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi},
      booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
      pages={12999--13008},
      year={2025}
    }