We introduced P-GSVC, the first scalable Gaussian representation framework for both images and videos, built upon progressive 2D Gaussian Splatting.
Abstract
Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training.
Motivation
To motivate why progressive 2DGS representation is not straightforward, let's consider a naive scheme where we rank the splats by their contribution to the rendered output and form the layers based on the contribution: splats with higher contribution form the coarse lower fidelity layers; splats with smaller contribution serve to refine the details and form the higher layers. This approach, however, produces visible artifacts at lower layers, such as noticeable gaps in the reconstructed frames, as shown in the above figure (Pruned Gaussian splats). The root cause is that splats are jointly trained to overfit the highest fidelity input by design, making them highly interdependent on each other. Consequently, removing any subset, even those with small contributions, can drastically degrade the overall quality. Instead, in P-GSVC, we build a layered progressive 2DGS where the base layer keeps complete scene structure and higher layers add details, supported by joint training across layers to ensure reliable and stable progressive decoding.
Methodology
We propose Progressive Gaussian Splat Video Coding (P-GSVC), a progressive scalable video codec that uses 2D Gaussian splats as primitives. P-GSVC applies to scalable image coding as well, treating the image as a single-frame video. The above figure shows the overview of P-GSVC, where an input video is represented by multiple layers of Gaussian splats. Just as a typical layered progressive codec, decoding only the base layer reconstructs a low-fidelity version of the input, and decoding additional enhancement layers increases the fidelity. To avoid cross-layer optimization conflicts and ensure smooth progressivity, P-GSVC uses a joint training strategy. In each iteration, it simultaneously optimizes two fidelity levels: one using all splats up to the highest layer, and another using only splats up to some layer i. During training, we cycle the target layer i from the base layer to the second-highest layer. This joint optimization allows splats' optimization trajectories to be aligned across layers while each layer's splats are still trained to meet the layer's objectives. Furthermore, by cycling through i, the gradient field remains relatively stable during transitions between optimization objectives, ensuring that both final and intermediate reconstructions converge to acceptable quality.
Results
PGSVC on Image
P-GSVC consistently outperforms LIG, demonstrating the effectiveness of our joint training strategy over the sequential layer-wise approach in image representation tasks.
PGSVC on Video
P-GSVC supports progressive video coding. The video above shows a progressive decoding demo on the Jockey sequence quantization.
Rate–Distortion Trade-off
Our P-GSVC employs a layered structure in which Gaussian parameters are shared across levels to achieve scalable video. As a result, the rate-distortion curves appear piecewise, yet the gaps between adjacent levels remain very small (around 0.2 dB), demonstrating the effectiveness of joint training. Inevitably, achieving scalability introduces a quality overhead. Under our default strategy of evenly distributing splats across layers, the quality remains below about 1.1 dB in PSNR across all layers under the same Gaussian budget. The difference in quality reduces at higher quality levels as the total number of splats increases. This trend suggests that allocating more splats to enhancement layers, which focus on reconstructing high-frequency details, can further reduce the overhead while preserving scalability.
Conclusion
We introduced P-GSVC, the first scalable Gaussian representation framework for both images and videos, built upon progressive 2D Gaussian Splatting. We highlighted and addressed a challenge in optimizing progressive Gaussian representations, showing that incremental layer-wise training would lead to poorer quality due to cross-layer optimization conflicts. Through joint, incremental, and cyclic training across levels, P-GSVC demonstrates that we can achieve stable progressive reconstruction of both images and videos with significant improvements over layer-wise sequential training. This work positions Gaussian splats as an alternative primitive for encoding images and videos for adaptive media delivery, bridging between classical scalable codecs and neuralbased codecs.