Listen to Look: Action Recognition by Previewing Audio

Gao, Ruohan; Oh, Tae-Hyun; Grauman, Kristen; Torresani, Lorenzo

doi:10.1109/cvpr42600.2020.01047

Cited by 241 publications

(179 citation statements)

References 58 publications

Supporting

Mentioning

178

Contrasting

Order By: Relevance

“…It is also possible to reduce the complexity of the action recognition process by exploiting the additional side information. For example, the sound information or pre-calculated features, which are normally existed in the compressed video data, can be utilized to provide the additional information for enabling the cliplevel processing [37] or selecting the dominant clips strongly related to the actions [38], [39], increasing the recognition accuracy while even reducing the computational complexity.…”

Section: B Related Workmentioning

confidence: 99%

“…Content may change prior to final publication. average number of activated MAC operations for performing a video-level inference with the baseline networks in Table 1, which are widely used for evaluating the amount of required CNN resources [38]. For fair comparisons, in addition to UCF-101 used at the prior experiments, we also tested different 3D-CNN architectures on HMDB-51 [47] and Kinetics-400 [48] datasets.…”

Section: B Hardware Costsmentioning

confidence: 99%

See 1 more Smart Citation

Low-Cost Network Scheduling of 3D-CNN Processing for Embedded Action Recognition

Lee

Kim

et al. 2021

IEEE Access

View full text Add to dashboard Cite

The recent 3D convolutional neural network (3D-CNN) is a promising candidate for solving the action recognition problem by providing attractive algorithm-level performance. Due to the excessive amount of computational costs, however, it is almost impractical to apply the advanced 3D-CNN architecture to the resource-limited real-time embedded system. In this work, we present several optimization schemes that can relax the complexity of 3D-CNN processing without sacrificing recognition accuracy. More precisely, we first develop several 3D-CNN architectures for exploiting the trade-off between the network complexity and recognition performance. Evaluating the current confidential level, then, the proposed method dynamically changes the network structure to be used for the next clip-level inference. In addition, we introduce a systematic way of managing the network sequence for minimizing the computing overheads while supporting the acceptable algorithm-level performance. Compared to the previous works, as a result, the proposed approaches drastically relax the processing costs as well as the energy consumption by selecting the simplest 3D-CNN architecture at the run time, allowing the cost-effective action recognition for embedded edges.

show abstract

Section: B Related Workmentioning

confidence: 99%

Section: B Hardware Costsmentioning

confidence: 99%

Low-Cost Network Scheduling of 3D-CNN Processing for Embedded Action Recognition

Lee

Kim

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Existing methods [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] have exploited various data modalities for HAR. In this section, we review HAR methods based on RGB, skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, WiFi, and other modalities.…”

Section: Single Modalitymentioning

confidence: 99%

“…In the early days, most of the works focused on using RGB (or gray-scale) videos as inputs for HAR [5], due to their popularity in daily life. Recent years have witnessed an emergence of works using other data modalities [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], including skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., for HAR. This is mainly thanks to the development of different kinds of accurate and affordable sensors (such as Kinect), and the distinct advantages of different data modalities for HAR in various application scenarios.…”

Section: Introductionmentioning

confidence: 99%

Human Action Recognition from Various Data Modalities: A Review

Rahmani¹,

Bennamoun²,

Ke³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Follow up works [2,33] further investigated to jointly learn the visual and audio representation using a visual-audio correspondence task. Instead of learning feature representations, recent works have also explored to localize sound source in images or videos [29,26,3,48,64], biometric matching [39], visual-guided sound source separation [64,15,19,60], auditory vehicle tracking [18], multi-modal action recognition [36,35,21], audio inpainting [66], emotion recognition [1], audio-visual event localization [56], multi-modal physical scene understanding [16], audio-visual co-segmentation [47], aerial scene recognition [27] and audio-visual embodied navigation [17].…”

Section: Audio-visual Learningmentioning

confidence: 99%

Foley Music: Learning to Generate Music from Videos

Gan

Huang

Chen

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph−Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results.

show abstract

Listen to Look: Action Recognition by Previewing Audio

Cited by 241 publications

References 58 publications

Low-Cost Network Scheduling of 3D-CNN Processing for Embedded Action Recognition

Low-Cost Network Scheduling of 3D-CNN Processing for Embedded Action Recognition

Human Action Recognition from Various Data Modalities: A Review

Foley Music: Learning to Generate Music from Videos

Contact Info

Product

Resources

About