2019
DOI: 10.1609/aaai.v33i01.33018674
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Bilinear Networks for Video Action Recognition

Abstract: Temporal modeling in videos is a fundamental yet challenging problem in computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames. Compared with some existing temporal methods which are limited in linear transformations, our TB model considers explicit quadratic bilinear transformations in the temporal domain for motion evolution and sequential relation modeling. We further leverage the factorized bilinear model… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 41 publications
(29 citation statements)
references
References 27 publications
0
29
0
Order By: Relevance
“…From Fig. 6 we have the following findings for RGB frames: 1) Basic* fails over BESS: 13,19,22,23,24,34,38,42,43,45,46 2,8,12,13,18,19,21,22,25,28,33,34,36,38,40,42,43,44,46,47,51 From Fig. 7 we have the following findings for Flow frames: 1) Basic* fails over BESS: 8,12,17,19,20,24,34,36,38,41,43,45,46 2,5,6,9,…”
Section: Effectiveness Of Bess and Hafsmentioning
confidence: 97%
See 1 more Smart Citation
“…From Fig. 6 we have the following findings for RGB frames: 1) Basic* fails over BESS: 13,19,22,23,24,34,38,42,43,45,46 2,8,12,13,18,19,21,22,25,28,33,34,36,38,40,42,43,44,46,47,51 From Fig. 7 we have the following findings for Flow frames: 1) Basic* fails over BESS: 8,12,17,19,20,24,34,36,38,41,43,45,46 2,5,6,9,…”
Section: Effectiveness Of Bess and Hafsmentioning
confidence: 97%
“…In recent years, action recognition performance in videos has improved with large margin through two-stream network architecture [7]. Inspired by the work in [7], numerous approaches have been proposed considering two-stream architecture as the backbone structure [8] - [10], [23] [26] [28]. In two stream network, one stream is dedicated for RGB and another one for flow frames to extract appearance and temporal features respectively.…”
Section: Related Work a Rgb And Flow Based Action Recognitionmentioning
confidence: 99%
“…In the past decades, Deep Neural Network (DNN), as an effective data-driven solution for computer vision tasks, has been exploited to accomplish high-level visual recognition tasks, e.g. image classification [12], action recognition [13,14], and low-level image restoration/enhancement tasks, including super-resolution [15,16,17], low-light enhancement [18,19], rain removal [20,21,22,23], etc. With powerful computing devices like GPUs and TPUs [24], given well-defined inputs and outputs, the network can automatically learn an end-to-end mapping from inputs to outputs.…”
Section: Introductionmentioning
confidence: 99%
“…Yet our approach does not utilize optical flow or the two-stream structure during both training and inference, leading to a significant reduction of more than 83.2% and 28.8% in parameters and FLOPs, respectively. For Something-Something v1, while our MF-KPSEM performs slightly inferior compared to S3D [83] and TSM [92], our MF-KPSEM is a much lighter network with smaller parameter size and requires less computational cost. Specifically, the parameters of our MF-KPSEM are…”
Section: Results and Comparisonmentioning
confidence: 89%
“…• CNN with learnable feature correlations: TBN [92], Res50-NL [8], Res50-CGD [93], Res50-CGNL [94], I3D-NL [8] and I3D-NL-GCN [95]. The performance results in Table 3.1 show that our network achieves the state-ofthe-art result on the Mini-Kinetics with only a minor increase in the number of parameters and required computational cost.…”
Section: Results and Comparisonmentioning
confidence: 98%