2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01047
|View full text |Cite
|
Sign up to set email alerts
|

Listen to Look: Action Recognition by Previewing Audio

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
178
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 241 publications
(179 citation statements)
references
References 58 publications
0
178
1
Order By: Relevance
“…It is also possible to reduce the complexity of the action recognition process by exploiting the additional side information. For example, the sound information or pre-calculated features, which are normally existed in the compressed video data, can be utilized to provide the additional information for enabling the cliplevel processing [37] or selecting the dominant clips strongly related to the actions [38], [39], increasing the recognition accuracy while even reducing the computational complexity.…”
Section: B Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…It is also possible to reduce the complexity of the action recognition process by exploiting the additional side information. For example, the sound information or pre-calculated features, which are normally existed in the compressed video data, can be utilized to provide the additional information for enabling the cliplevel processing [37] or selecting the dominant clips strongly related to the actions [38], [39], increasing the recognition accuracy while even reducing the computational complexity.…”
Section: B Related Workmentioning
confidence: 99%
“…Content may change prior to final publication. average number of activated MAC operations for performing a video-level inference with the baseline networks in Table 1, which are widely used for evaluating the amount of required CNN resources [38]. For fair comparisons, in addition to UCF-101 used at the prior experiments, we also tested different 3D-CNN architectures on HMDB-51 [47] and Kinetics-400 [48] datasets.…”
Section: B Hardware Costsmentioning
confidence: 99%
“…Existing methods [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] have exploited various data modalities for HAR. In this section, we review HAR methods based on RGB, skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, WiFi, and other modalities.…”
Section: Single Modalitymentioning
confidence: 99%
“…In the early days, most of the works focused on using RGB (or gray-scale) videos as inputs for HAR [5], due to their popularity in daily life. Recent years have witnessed an emergence of works using other data modalities [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], including skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., for HAR. This is mainly thanks to the development of different kinds of accurate and affordable sensors (such as Kinect), and the distinct advantages of different data modalities for HAR in various application scenarios.…”
Section: Introductionmentioning
confidence: 99%
“…Follow up works [2,33] further investigated to jointly learn the visual and audio representation using a visual-audio correspondence task. Instead of learning feature representations, recent works have also explored to localize sound source in images or videos [29,26,3,48,64], biometric matching [39], visual-guided sound source separation [64,15,19,60], auditory vehicle tracking [18], multi-modal action recognition [36,35,21], audio inpainting [66], emotion recognition [1], audio-visual event localization [56], multi-modal physical scene understanding [16], audio-visual co-segmentation [47], aerial scene recognition [27] and audio-visual embodied navigation [17].…”
Section: Audio-visual Learningmentioning
confidence: 99%