ImmerseFM-3D : a foundation model framework for generalizable 360-degree video streaming with cross-modal scene understanding

Gallena Watthage, Reka S. and Fernando, Anil (2026) ImmerseFM-3D : a foundation model framework for generalizable 360-degree video streaming with cross-modal scene understanding. Applied Sciences, 16 (7). 3424. ISSN 2076-3417 (https://doi.org/10.3390/app16073424)

[thumbnail of Watthage-Fernando-AS-2026-ImmerseFM-3D-a-foundation-model-framework-for-generalizable-360-degree-video-streaming]
Preview
Text. Filename: Watthage-Fernando-AS-2026-ImmerseFM-3D-a-foundation-model-framework-for-generalizable-360-degree-video-streaming.pdf
Final Published Version
License: Creative Commons Attribution 4.0 logo

Download (2MB)| Preview

Abstract

Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy.

ORCID iDs

Gallena Watthage, Reka S. and Fernando, Anil ORCID logoORCID: https://orcid.org/0000-0002-2158-2367;