Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Abstract

Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges—training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE (BLock‑sparse Attention Meets step Distillation for Efficient video generation), an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10× end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89× speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations.

14.10×

speedup on Wan2.1-1.3B

8.89×

speedup on CogVideoX-5B

0.569

VBench-2.0 CogVideoX score

Key Contributions

Data-Free Joint Training

Novel framework that synergistically integrates adaptive sparse attention directly into sparsity-aware step distillation, overcoming limitations of sequential approaches.

Adaptive Block-Sparse Attention

Dynamic, content-aware attention mechanism that generates hardware-friendly sparsity masks on-the-fly to focus computation on salient features.

Significant Acceleration

Achieves remarkable speedups (14.10× on Wan2.1-1.3B, 8.89× on CogVideoX-5B) with consistent quality improvements on VBench-2.0 benchmarks.

Method Overview

Figure 1: The training mechanism of Video-BLADE within a single distillation interval [t_i-1, t_i). The Sparse Generator (G_θ) denoises the input x_{t_i} to produce the sample x_{t_i-1}. Crucially, this output is then re-corrupted with Gaussian noise to create an intermediate sample x_{t_j}. A dedicated Fake Score model evaluates this re-noised sample. Its output is contrasted with the score from the Real Score model (which is the pre-trained teacher model) to compute the Distribution Matching Loss (∇_θ D_KL). This loss directly updates the student generator, forcing it to align its generation trajectory with the teacher's at a distributional level.

BLADE Framework

Our framework is based on a student-teacher paradigm where the teacher is a pre-trained, high-quality multi-step diffusion model, and the student initially shares the same architecture but with standard self-attention layers replaced by our Adaptive Block-Sparse Attention (ASA) mechanism.

Key Components:

The pre-trained teacher model f_φ, which provides the real data score s_φ.
The student generator G_θ, which learns to produce high-fidelity samples in a few steps.
A fake score model f_ψ, which provides the fake score s_ψ by approximating the student's intractable sample score.

Adaptive Block-Sparse Attention Mechanism

Figure 2: The two-stage process for Adaptive Block-Sparse Attention mask generation. (1) The Efficient Attention Prober samples a few representative tokens (e.g., k=16) from each block to compute a low-cost max-pooled attention matrix P. (2) The Threshold-based Mask Generator sorts the scores in P and selects the top blocks that contain a specified threshold (e.g., 95%), producing the final binary mask M. To enrich the context for training, we augment the key matrix K by concatenating it with a pooled version: K = Concat(K, MeanPool_n(K)), where MeanPool_n(K) denotes mean pooling over a window of size n. During attention computation, the original K region uses the binary block mask M, while the pooled region receives a fixed additive mask of ln(n), softly guiding attention without disrupting sparsity.

Adaptive Block-Sparse Attention (ASA and ASA GT)

Adaptive Block-Sparse Attention (ASA) is a two-stage sparse attention mechanism that dynamically selects important query-key blocks during each forward pass. It uses a lightweight attention prober followed by threshold-based mask generation, significantly reducing computational cost while maintaining relevance.

ASA GT extends ASA by introducing a global context preservation mechanism. It appends globally pooled tokens to the key sequence and applies a fixed additive mask, ensuring semantic consistency even under extreme sparsity.

Key Features:

Online Dynamic Block Masking: Efficient and reliable mask generation at each step without supervision
High Sparsity Support: Achieves up to 80% sparsity with no performance degradation
Global Context (ASA GT only): Global tokens with additive masking improve semantic preservation, especially in generative tasks
Training Compatibility: Supports end-to-end training and integrates well with distillation-based optimization

Experimental Results

Performance on VBench-2.0

Table 1: Video Quality Evaluation on VBench-2.0. BLADE achieves superior quality while maintaining significant speedup across different video generation models.

Efficiency Analysis

Table 2: Efficiency analysis on Wan2.1-1.3B (test on an H20). Our ASA kernel achieves 3.30× speedup with 14.10× end-to-end acceleration.

Sparse Attention Comparison

Table 3: Comparison of training-free sparse attention methods on Wan2.1-1.3B (8-step distilled model). ASA significantly outperforms existing methods in both PSNR and SSIM.

Video Demonstrations

Left: Baseline (50 steps) Right: BLADE (8 steps)

A garden comes to life as butterflies flutter amidst blossoms

A small boy sprints through torrential downpour with lightning

Astronaut shakes hands with alien being on Mars

Elderly gentleman paints oil artwork at water's edge

Mature man ponders thoughtfully in dimly lit bar

Left: Baseline (50 steps) Right: BLADE (8 steps)

One person opens the door for another person

The camera orbits around. Taj Mahal, the camera circles around

A cheetah with the shell of a tortoise, moving quickly but with a protective outer layer

A woman is playing ping-pong

The camera orbits around. Birdhouse, the camera circles around

Citation

@inproceedings{gu2025videoblade,
  title={Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation},
  author={Gu, Youping and Li, Xiaolong and Hu, Yuhao and Zhuang, Bohan},
  booktitle={International Conference on Learning Representations},
  year={2025}
}