AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

Yue, Meng; Lin, Chung-Ching; Panda, Rameswar; Sattigeri, Prasanna; Karlinsky, Leonid; Oliva, Aude; Saenko, Kate; Feris, Rogério

doi:10.1007/978-3-030-58571-6_6

Cited by 149 publications

(158 citation statements)

References 45 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…The proposed approach is evaluated on FCVID using the mean average precision (mAP) and compared against the top-scoring approaches of the literature, i.e. PivotCorrNN [15], LiteEval [30], AdaFrame [31], SCSampler [17], ST-VLAD [22] and AR-Net [19]. On YLI-MED, the top-1 accuracy is utilized, and the comparison is performed against the top-scoring literature approaches for this dataset, i.e.…”

Section: Resultsmentioning

confidence: 99%

“…In [17], SCSampler uses a lightweight saliency model to select the most salient temporal clips within a long video. In [19], the adaptive resolution network (AR-Net) selects on-the-fly the optimal frame resolution for classifying the video, outperforming the other methods in the FCVID dataset. In contrast to C2D approaches, C3D ones learn the space and time information jointly by exploiting 3D convolutions.…”

Section: Related Workmentioning

confidence: 99%

“…The training is performed using Adam optimizer, batch size 16, exponential schedule with initial learning rate 10 −4 , decay factor 0.9 at every epoch, and 30 epochs in total. mAP(%) ST-VLAD [22] 77.5 PivotCorrNN [15] 77.6 LiteEval [30] 80.0 AdaFrame [31] 80.2 SCSampler [17] 81.0 AR-Net (ResNet backbone) [19] 81.3 AR-Net (EfficientNet backbone) [19] 84.4 ObjectGraphs (proposed; ResNet backbone) 84.6…”

Section: Setupmentioning

confidence: 99%

“…improved dense trajectories [25]. ii) C2D: Techniques that utilize deep convolutional neural networks (DCNNs) with 2D convolutional kernels to extract the static event-related information at frame-level, and subsequently utilize an appropriate technique to capture the temporal dynamics of the event [26,22,15,32,30,31,17,19]. iii) C3D: DCNNs that use 3D convolutional kernels to encode simultaneously the spatiotemporal event information in videos [24,28,8].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Gkalelis

Goulas

Galanopoulos

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph's adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-ofthe-art performance on the publicly available FCVID and YLI-MED datasets 1 .

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Gkalelis

Goulas

Galanopoulos

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

show abstract

“…Zheng et al [ 64 ] used reinforcement learning agents to select effective segments for inference. Meng et al [ 65 ] proposed to use reinforcement learning to select the optimal resolution for each frame in the video input for effective action recognition in long untrimmed videos.…”

Section: Related Workmentioning

confidence: 99%

ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Zhang

Xiong

et al. 2021

Sensors

View full text Add to dashboard Cite

Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

show abstract

Fusing surveillance videos and three‐dimensional scene: A mixed reality system

Cui

Khan

et al. 2022

Computer Animation & Virtual

View full text Add to dashboard Cite

Augmented Virtual Environments (AVE) or Virtual-Reality Fusion systems fuse dynamic videos with static three-dimensional (3D) models of a virtual environment to provide an optimal solution for visualizing and understanding multichannel surveillance systems. However, texture distortion caused by viewpoint changes in such systems is a critical issue that needs to be addressed. To minimize texture fusion distortion, this paper presents a novel virtual environment system in two phases, offline and online phases, to dynamically fuse multiple surveillance videos with a virtual 3D scene. In the offline phase, a static virtual environment is obtained by performing a 3D photogrammetric reconstruction from the input images of the scene. In the online phase, the virtual environment is augmented by fusing multiple videos through two optional strategies. One strategy is to dynamically map images of different videos onto a 3D model of the virtual environment, and the other is to extract moving objects and represent them as billboards. The system can be used to visualize a 3D environment from any viewpoint augmented by real-time videos. Experiments and user studies in different scenarios demonstrate the superiority of our system.

show abstract

AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

Cited by 149 publications

References 45 publications

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

ASNet: Auto-Augmented Siamese Neural Network for Action Recognition

Fusing surveillance videos and three‐dimensional scene: A mixed reality system

Contact Info

Product

Resources

About