DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang
College of Intelligence and Computing, Tianjin University

Decoding part of dynamic codes

Prediction

Decoding part of static codes

Abstract

Video implicit neural representations have recently evolved rapidly to become a new tool for high-quality video representation and compression. However, existing work mainly relies on neural networks to model videos uniformly, neglecting the distinct modeling of static and dynamic components across the entire video. Furthermore, this unified modeling approach underutilizes the temporal correlations present in videos. To exploit the temporal correlations of videos for efficient modeling, we propose DS-NeRV, which decomposes videos into sparse, learnable static and dynamic codes without the need for explicit optical flow or residual supervision. These codes represent the shared static and cross-time dynamic information within the video. Additionally, we design a cross-channel attention-based fusion module to fuse these static and dynamic codes for frame decoding. Our approach achieves a high quality reconstruction of 31.20 PSNR with only 0.35M parameters due to its compact static and dynamic codes design. We outperform existing video implicit neural representations in many downstream tasks and achieve comparable performance to some other conventional video compression methods.

(1) DS-NeRV overview

DS-NeRV decomposes the video into learnable static and dynamic codes. Static Codes. The two gold yellow static codes shown above are the two nearest selected, and after being weighted sum, they are passed to the fusion decoder.Dynamic Codes. The original dynamic code distribution is shown in gray, and the interpolated dynamic code is shown in yellow. The selected one is shown in blue which is forwarded to fusion decoder.

(2) Fusion Decoder and CCA Fusion

(a) The pipeline of Fusion Decoder. Decoder takes as input the static and dynamic codes corresponding to index t and fuses their information to output frame. (b) Architecture of CCA Fusion Module. The CCA module fuses the information of static and dynamic codes by cross-channel attention.

(3) Video Reconstruction

Quantitative results

Qualitative results

Comparison of video reconstruction results on UVG and DAVIS. (Top) Jockey. (Bottom) Blackswan.

(4) Video Inpainting

Quantitative results

Qualitative results with disperse mask

Ground Truth

Ours

HNeRV

DNeRV

Qualitative results with central mask

GT

Ground Truth

Ours

HNeRV

DNeRV

(5) Video Interpolation

Quantitative results

Qualitative results

Ground Truth

Ours

HNeRV

DNeRV

(6) Video Compression

Quantitative results