Video Synthesis from Intensity and Event Frames

Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

doi:10.1007/978-3-030-30642-7_28

Cited by 14 publications

(12 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is why for more complex tasks such as image reconstruction or monocular depth, state-of-the-art methods use a data-driven approach [7], [6], [19], [20], [21].Of these, many rely on recurrent architectures which can leverage long time windows of events for improved prediction [21], [7]. Although there exist many purely event-based learning methods, few address the fusion of images and events [9], [10], [22]. These approaches fuse both modalities by synchronizing and concatenating both inputs and passing them to a standard feed-forward network [9], [10], [22].…”

Section: Related Workmentioning

confidence: 99%

“…They have identical residuals and decoders but instead of recurrent state combination operators, they feature recurrent convLSTM encoders at each level. While E only receives voxel grids as input, I receives only gray-scale frames and E+I receives stacks of voxel grids and frames, similar to [9]. For E+I when a new voxel grid arrives, we stack it with a copy of the last seen image.…”

Section: B Baselinesmentioning

confidence: 99%

“…Although there exist many purely event-based learning methods, few address the fusion of images and events [9], [10], [22]. These approaches fuse both modalities by synchronizing and concatenating both inputs and passing them to a standard feed-forward network [9], [10], [22]. While this strategy improves over passing each input individually, it discards the asynchronous nature and high temporal resolution of the events through stacking and synchronization.…”

Section: Related Workmentioning

confidence: 99%

“…1-C in the appendix. While E only receives voxel grids as input, I receives only gray-scale frames and E+I receives stacks of voxel grids and frames, similar to [9]. For E+I when a new voxel grid arrives, we stack it with a copy of the last seen image.…”

Section: B Baselinesmentioning

confidence: 99%

“…By contrast, learning-based methods have leveraged large datasets to generate more accurate predictions. However, current learning-based methods for events are limited in that they group events and frames into synchronized stacks which are passed to feed-forward neural networks [9], [10]. Not only does this strategy sacrifice the asynchronicity and high temporal resolution of events, but it also limits the temporal context by using simple feed-forward networks instead of RNNs.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Combining Events and Frames Using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

Gehrig

Michelle

Gehrig

et al. 2021

IEEE Robot. Autom. Lett.

114

View full text Add to dashboard Cite

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

show abstract