Abstract:Learning based video compression attracts increasing attention in the past few years. The previous hybrid coding approaches rely on pixel space operations to reduce spatial and temporal redundancy, which may suffer from inaccurate motion estimation or less effective motion compensation. In this work, we propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space. Specifical… Show more
“…For the traditional video codecs [52] [53], different linear transformations are exploited to better capture the statistical characteristics of the texture and motion information within the videos. Latter, learnable video codecs [54] [55] [56] [57] [58] [59] [60] gain increasing attention. Following the traditional hybrid video compression framework, Lu et al [54] proposed the first endto-end optimized video compression framework, in which all the key components in H.264/H.265 are replaced with deep neural networks.…”
Most video understanding methods are learned on high-quality videos. However, in real-world scenarios, the videos are first compressed before the transportation and then decompressed for understanding. The decompressed videos may have lost the critical information to the downstream tasks. To address this issue, we propose the first coding framework for compressed video understanding, where another learnable analytic bitstream is simultaneously transported with the original video bitstream. With the dedicatedly designed self-supervised optimization target and dynamic network architectures, this new stream largely boosts the downstream tasks yet with a small bit cost. By only one-time training, our framework can be deployed for multiple downstream tasks. Our framework also enjoys the best of both two worlds, (1) high efficiency of industrial video codec and (2) flexible coding capability of neural networks (NNs). Finally, we build a rigorous benchmark for compressed video understanding on three popular tasks over seven large-scale datasets and four different compression levels. The proposed Understanding oriented Video Coding framework UVC consistently demonstrates significantly stronger performances than the baseline industrial codec.
“…For the traditional video codecs [52] [53], different linear transformations are exploited to better capture the statistical characteristics of the texture and motion information within the videos. Latter, learnable video codecs [54] [55] [56] [57] [58] [59] [60] gain increasing attention. Following the traditional hybrid video compression framework, Lu et al [54] proposed the first endto-end optimized video compression framework, in which all the key components in H.264/H.265 are replaced with deep neural networks.…”
Most video understanding methods are learned on high-quality videos. However, in real-world scenarios, the videos are first compressed before the transportation and then decompressed for understanding. The decompressed videos may have lost the critical information to the downstream tasks. To address this issue, we propose the first coding framework for compressed video understanding, where another learnable analytic bitstream is simultaneously transported with the original video bitstream. With the dedicatedly designed self-supervised optimization target and dynamic network architectures, this new stream largely boosts the downstream tasks yet with a small bit cost. By only one-time training, our framework can be deployed for multiple downstream tasks. Our framework also enjoys the best of both two worlds, (1) high efficiency of industrial video codec and (2) flexible coding capability of neural networks (NNs). Finally, we build a rigorous benchmark for compressed video understanding on three popular tasks over seven large-scale datasets and four different compression levels. The proposed Understanding oriented Video Coding framework UVC consistently demonstrates significantly stronger performances than the baseline industrial codec.
“…[24,49], which use 3D convolution architectures, and Refs. [3,9,11,12,16,23,26,33,36,37,48,50,52,53,69], which model P-frames as an optical flow field applied to the previous frame plus a residual model.…”
Section: Related Workmentioning
confidence: 99%
“…first estimate of the current frame. In addition to the optical flow, pixel-level [3,37] or feature-level [26] residuals are compressed. The final prediction for the frame is given by applying the optical flow field to the previous reconstructed frame in an operation known as warping or motion compensation and adding the residuals.…”
We propose a method to compress full-resolution video sequences with implicit neural representations. Each frame is represented as a neural network that maps coordinate positions to pixel values. We use a separate implicit network to modulate the coordinate inputs, which enables efficient motion compensation between frames. Together with a small residual network, this allows us to efficiently compress Pframes relative to the previous frame. We further lower the bitrate by storing the network weights with learned integer quantization. Our method, which we call implicit pixel flow (IPF), offers several simplifications over established neural video codecs: it does not require the receiver to have access to a pretrained neural network, does not use expensive interpolation-based warping operations, and does not require a separate training dataset. We demonstrate the feasibility of neural implicit compression on image and video data.
“…Most previous works focus on the low-delay P configuration [6,7] (used for videoconferencing) and omit the Random Access configuration (used for streaming at large). Furthermore, most neural codecs [10,11] are assessed using an I frame period shorter (e.g. 10 or 12 frames, regardless of the video framerate) than expected by the Common Test Conditions of modern video coders such as those defined for HEVC or VVC [12].…”
This paper introduces AIVC, an end-to-end neural video codec. It is based on two conditional autoencoders MNet and CNet, for motion compensation and coding. AIVC learns to compress videos using any coding configurations through a single end-to-end rate-distortion optimization. Furthermore, it offers performance competitive with the recent video coder HEVC under several established test conditions. A comprehensive ablation study is performed to evaluate the benefits of the different modules composing AIVC. The implementation is made available at https: //orange-opensource.github.io/AIVC/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.