STAC: A spatio-temporal transformer with adaptive context for video compression

Gallena Watthage, Reka Sandaruwan and Fernando, Anil (2026) STAC: A spatio-temporal transformer with adaptive context for video compression. Applied Sciences, 16 (9). 4568. ISSN 2076-3417 (https://doi.org/10.3390/app16094568)

[thumbnail of Watthage-Fernando-Applsci-2026-STAC-A-spatio-temporal-transformer-with-adaptive-context-for-video-compression]
Preview
Text. Filename: Watthage-Fernando-Applsci-2026-STAC-A-spatio-temporal-transformer-with-adaptive-context-for-video-compression.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (5MB)| Preview

Abstract

The rapid growth of video content development requires more effective compression solutions than traditional ones. Although neural video compression has demonstrated impressive advances, the current methods are having a hard time with how to effectively model long-range temporal dependencies and react to different content properties. We introduce STAC (Spatio-Temporal Adaptive Context), a transformer-based neural video compression scheme that does not have these limitations, and makes three original contributions. First, the Adaptive Context Selector (ACS) is the dynamic evaluation and selection of the most informative reference frames, based on learned relevance scoring, in place of the traditional use of predetermined adjacent frame sets. Second, Enhanced Sliding Window Attention (ESWA) is an effective computational model of spatio-temporal correlations by the integration of learnable local bias and temporal gating information into a computationally adjustable attention model. Third, a dual-path entropy model is an adaptively learned fusion gate that combines channel-wise autoregressive prediction with spatio-temporal prediction to produce better probability estimations for entropy coding. Trained on the Vimeo-90k dataset using a four-phase curriculum with the Adam optimiser over approximately 2.2 M total steps. We tested STAC using six benchmark videos, such as UVG, MCL-JCV, and HEVC Class B, C, D and E videos, at varying test settings. The experimental findings prove that STAC, on average, saves a BD-rate of 32.20% in the YUV colourspace with an intra-period of −1. The consistent improvement across both PSNR and MS-SSIM metrics confirms that STAC’s coding gains arise from genuinely improved probability modelling, rather than metric-specific optimisation. Evaluations were performed on six standard benchmarks (UVG, MCL-JCV, and HEVC Classes B, C, D, and E) under 24 experimental configurations (six datasets × 2, and colourspaces × 2 intra-period settings), with all methods tested under identical conditions using the same sequences, frames (96 per sequence), and VTM-17.0 anchor codec. STAC achieves 32.20% average BD-rate savings over VTM under YUV IP = −1, outperforming the prior state-of-the-art DCMVC by 2.70 percentage points. Under IP = 32, STAC achieves −27.01%, with only 5.19 pp degradation versus 6.42 pp for DCMVC. The results generalise to the RGB colourspace (−31.23%) and scale from 240p (−35.19%) to 4K (−36.35%).

ORCID iDs

Gallena Watthage, Reka Sandaruwan and Fernando, Anil ORCID logoORCID: https://orcid.org/0000-0002-2158-2367;