Zhejiang University
Huawei Technologies

Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Youping Gu1* XIAOLONG LI1* Yuhao Hu2 Bohan Zhuang1
1Zhejiang University
2Central Media Technology Institute, Huawei Technologies
*Equal contribution

Abstract

Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges—training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE (BLock‑sparse Attention Meets step Distillation for Efficient video generation), an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10× end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89× speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations.

14.10×
speedup on Wan2.1-1.3B
8.89×
speedup on CogVideoX-5B
0.569
VBench-2.0 CogVideoX score

Key Contributions

Data-Free Joint Training

Novel framework that synergistically integrates adaptive sparse attention directly into sparsity-aware step distillation, overcoming limitations of sequential approaches.

Adaptive Block-Sparse Attention

Dynamic, content-aware attention mechanism that generates hardware-friendly sparsity masks on-the-fly to focus computation on salient features.

Significant Acceleration

Achieves remarkable speedups (14.10× on Wan2.1-1.3B, 8.89× on CogVideoX-5B) with consistent quality improvements on VBench-2.0 benchmarks.

Method Overview

BLADE Framework Overview

Figure 1: The training mechanism of Video-BLADE within a single distillation interval [ti-1, ti). The Sparse Generator (Gθ) denoises the input xti to produce the sample xti-1. Crucially, this output is then re-corrupted with Gaussian noise to create an intermediate sample xtj. A dedicated Fake Score model evaluates this re-noised sample. Its output is contrasted with the score from the Real Score model (which is the pre-trained teacher model) to compute the Distribution Matching Loss (∇θ DKL). This loss directly updates the student generator, forcing it to align its generation trajectory with the teacher's at a distributional level.

BLADE Framework

Our framework is based on a student-teacher paradigm where the teacher is a pre-trained, high-quality multi-step diffusion model, and the student initially shares the same architecture but with standard self-attention layers replaced by our Adaptive Block-Sparse Attention (ASA) mechanism.

Key Components:

  • The pre-trained teacher model fφ, which provides the real data score sφ.
  • The student generator Gθ, which learns to produce high-fidelity samples in a few steps.
  • A fake score model fψ, which provides the fake score sψ by approximating the student's intractable sample score.
Adaptive Block-Sparse Attention Mechanism

Figure 2: The two-stage process for Adaptive Block-Sparse Attention mask generation. (1) The Efficient Attention Prober samples a few representative tokens (e.g., k=16) from each block to compute a low-cost max-pooled attention matrix P. (2) The Threshold-based Mask Generator sorts the scores in P and selects the top blocks that contain a specified threshold (e.g., 95%), producing the final binary mask M. To enrich the context for training, we augment the key matrix K by concatenating it with a pooled version: K = Concat(K, MeanPooln(K)), where MeanPooln(K) denotes mean pooling over a window of size n. During attention computation, the original K region uses the binary block mask M, while the pooled region receives a fixed additive mask of ln(n), softly guiding attention without disrupting sparsity.

Adaptive Block-Sparse Attention (ASA and ASA GT)

Adaptive Block-Sparse Attention (ASA) is a two-stage sparse attention mechanism that dynamically selects important query-key blocks during each forward pass. It uses a lightweight attention prober followed by threshold-based mask generation, significantly reducing computational cost while maintaining relevance.

ASA GT extends ASA by introducing a global context preservation mechanism. It appends globally pooled tokens to the key sequence and applies a fixed additive mask, ensuring semantic consistency even under extreme sparsity.

Key Features:

  • Online Dynamic Block Masking: Efficient and reliable mask generation at each step without supervision
  • High Sparsity Support: Achieves up to 80% sparsity with no performance degradation
  • Global Context (ASA GT only): Global tokens with additive masking improve semantic preservation, especially in generative tasks
  • Training Compatibility: Supports end-to-end training and integrates well with distillation-based optimization

Experimental Results

Performance on VBench-2.0

Main experimental results on VBench-2.0

Table 1: Video Quality Evaluation on VBench-2.0. BLADE achieves superior quality while maintaining significant speedup across different video generation models.

Efficiency Analysis

Efficiency analysis on Wan2.1-1.3B

Table 2: Efficiency analysis on Wan2.1-1.3B (test on an H20). Our ASA kernel achieves 3.30× speedup with 14.10× end-to-end acceleration.

Sparse Attention Comparison

Comparison of sparse attention methods

Table 3: Comparison of training-free sparse attention methods on Wan2.1-1.3B (8-step distilled model). ASA significantly outperforms existing methods in both PSNR and SSIM.

Video Demonstrations

Left: Baseline (50 steps) Right: BLADE (8 steps)

A garden comes to life as butterflies flutter amidst blossoms

A garden comes to life as butterflies flutter amidst blossoms
A garden comes to life as butterflies flutter amidst blossoms

A small boy sprints through torrential downpour with lightning

A small boy sprints through torrential downpour with lightning
A small boy sprints through torrential downpour with lightning

Astronaut shakes hands with alien being on Mars

Astronaut shakes hands with alien being on Mars
Astronaut shakes hands with alien being on Mars

Elderly gentleman paints oil artwork at water's edge

Elderly gentleman paints oil artwork at water's edge
Elderly gentleman paints oil artwork at water's edge

Mature man ponders thoughtfully in dimly lit bar

Mature man ponders thoughtfully in dimly lit bar
Mature man ponders thoughtfully in dimly lit bar
Left: Baseline (50 steps) Right: BLADE (8 steps)

One person opens the door for another person

One person opens the door for another person
One person opens the door for another person

The camera orbits around. Taj Mahal, the camera circles around

The camera orbits around. Taj Mahal, the camera circles around
The camera orbits around. Taj Mahal, the camera circles around

A cheetah with the shell of a tortoise, moving quickly but with a protective outer layer

A cheetah with the shell of a tortoise, moving quickly but with a protective outer layer
A cheetah with the shell of a tortoise, moving quickly but with a protective outer layer

A woman is playing ping-pong

A woman is playing ping-pong
A woman is playing ping-pong

The camera orbits around. Birdhouse, the camera circles around

The camera orbits around. Birdhouse, the camera circles around
The camera orbits around. Birdhouse, the camera circles around