Performance on VBench-2.0

Table 1: Video Quality Evaluation on VBench-2.0. BLADE achieves superior quality while maintaining significant speedup across different video generation models.
Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges—training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE (BLock‑sparse Attention Meets step Distillation for Efficient video generation), an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10× end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89× speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations.
Novel framework that synergistically integrates adaptive sparse attention directly into sparsity-aware step distillation, overcoming limitations of sequential approaches.
Dynamic, content-aware attention mechanism that generates hardware-friendly sparsity masks on-the-fly to focus computation on salient features.
Achieves remarkable speedups (14.10× on Wan2.1-1.3B, 8.89× on CogVideoX-5B) with consistent quality improvements on VBench-2.0 benchmarks.
Figure 1: The training mechanism of Video-BLADE within a single distillation interval [ti-1, ti). The Sparse Generator (Gθ) denoises the input xti to produce the sample xti-1. Crucially, this output is then re-corrupted with Gaussian noise to create an intermediate sample xtj. A dedicated Fake Score model evaluates this re-noised sample. Its output is contrasted with the score from the Real Score model (which is the pre-trained teacher model) to compute the Distribution Matching Loss (∇θ DKL). This loss directly updates the student generator, forcing it to align its generation trajectory with the teacher's at a distributional level.
Our framework is based on a student-teacher paradigm where the teacher is a pre-trained, high-quality multi-step diffusion model, and the student initially shares the same architecture but with standard self-attention layers replaced by our Adaptive Block-Sparse Attention (ASA) mechanism.
Figure 2: The two-stage process for Adaptive Block-Sparse Attention mask generation. (1) The Efficient Attention Prober samples a few representative tokens (e.g., k=16) from each block to compute a low-cost max-pooled attention matrix P. (2) The Threshold-based Mask Generator sorts the scores in P and selects the top blocks that contain a specified threshold (e.g., 95%), producing the final binary mask M. To enrich the context for training, we augment the key matrix K by concatenating it with a pooled version: K = Concat(K, MeanPooln(K)), where MeanPooln(K) denotes mean pooling over a window of size n. During attention computation, the original K region uses the binary block mask M, while the pooled region receives a fixed additive mask of ln(n), softly guiding attention without disrupting sparsity.
Adaptive Block-Sparse Attention (ASA) is a two-stage sparse attention mechanism that dynamically selects important query-key blocks during each forward pass. It uses a lightweight attention prober followed by threshold-based mask generation, significantly reducing computational cost while maintaining relevance.
ASA GT extends ASA by introducing a global context preservation mechanism. It appends globally pooled tokens to the key sequence and applies a fixed additive mask, ensuring semantic consistency even under extreme sparsity.
Table 1: Video Quality Evaluation on VBench-2.0. BLADE achieves superior quality while maintaining significant speedup across different video generation models.
Table 2: Efficiency analysis on Wan2.1-1.3B (test on an H20). Our ASA kernel achieves 3.30× speedup with 14.10× end-to-end acceleration.
Table 3: Comparison of training-free sparse attention methods on Wan2.1-1.3B (8-step distilled model). ASA significantly outperforms existing methods in both PSNR and SSIM.