AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang, Yulin; Yue, Yang; Lin, Yuanze; Jiang, Haojun; Lai, Zihang; Куликов, В. А.; Orlov, Nikita; Huang, Gao

doi:10.1109/cvpr52688.2022.01943

Cited by 40 publications

(16 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ii) Concerning our ViGAT variant that utilizes a ResNet backbone pretrained on ImageNet, this outperforms the bestperforming literature approaches that similarly use a ResNet backbone in FCVID and ActivityNet (see Tables 1 and 3). Specifically, we observe a significant performance gain of 1% over AdaFocusV2 [19], which is the previous state-of-theart method. We also see that ViGAT provides a performance improvement of 1.4% over ObjectGraphs [5], which is the best previous bottom-up method.…”

Section: Event Recognition Resultsmentioning

confidence: 72%

“…The proposed approach is compared against the top-scoring approaches of the literature on the three employed datasets, specifically, TBN [44], BAT [16], MARS [62], Fast-S3D [38], RMS [64], CGNL [30], ATFR [72], Ada3D [17], TCPNet [45], LgNet [68], ST-VLAD [50], PivotCorrNN [53], LiteEval [57], AdaFrame [54], Listen to Look [56], SCSampler [73], AR-Net [7], SMART [59], ObjectGraphs [5], MARL [55], FrameExit [6] and AdaFocusV2 [19] (note that not all of these works report results for all the datasets mAP(%) AdaFrame [54] 71.5 Listen to Look [56] 72.3 LiteEval [57] 72.7 SCSampler [73] 72.9 AR-Net [7] 73.8 FrameExit [6] 77.3 AdaFocusV2 [19] 79.0 AR-Net (EfficientNet backbone) [7] 79.7 MARL (ResNet backbone on Kinetics) [55] 82.9 FrameExit (X3D-S backbone) [6] 87 used in the present work). The reported results on FCVID, MiniKinetics and ActivityNet are shown in Tables 1, 2 and 3, respectively.…”

Section: Event Recognition Resultsmentioning

confidence: 99%

“…Contrarily to the above, AdaFocus [58] utilizes a reinforcement learning policy network to leverage spatial redundancy, i.e., selects the most salient regions in the video frames with respect to the action recognition task. In [19], AdaFocusV2 extends [58] by replacing reinforcement learning with a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. The above methods operate on untrimmed videos (i.e., videos that contain many irrelevant frames to the underlying action), where it is much easier to identify and discard less-significant image regions or entire frames.…”

Section: ) Top-down Approachesmentioning

confidence: 99%

“…Higher IC and lower AD indicate a better explanation. Additionally, we utilize two more general explainability measures, fidelity minus (F −) and fidelity plus (F +) [86], defined as mAP(%) ST-VLAD [50] 77.5 PivotCorrNN [53] 77.6 LiteEval [57] 80.0 AdaFrame [54] 80.2 SCSampler [73] 81.0 AR-Net [7] 81.3 SMART [59] 82.1 AR-Net (EfficientNet backbone) [7] 84.4 ObjectGraphs [5] 84.6 AdaFocusV2 [19] 85.0 ViGAT (proposed; ResNet backbone) 86.0 ViGAT (proposed; ViT backbone) 88.1…”

Section: Evaluation Measuresmentioning

confidence: 99%

“…Such an approach can also facilitate the generation of object-and frame-based explanations about the event recognition outcome. An example of this is shown in the second row of the figure . a large amount of them is irrelevant and does not need to be thoroughly analyzed [15]- [19].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

View full text Add to dashboard Cite

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet) a .a Source code and trained models will be made available upon acceptance.INDEX TERMS Video event recognition, eXplainable AI (XAI), graph attention network, factorized attention, bottom-up.

show abstract

Section: Event Recognition Resultsmentioning

confidence: 72%