2019
DOI: 10.1109/lsp.2019.2923918
|View full text |Cite
|
Sign up to set email alerts
|

Three-Stream Network With Bidirectional Self-Attention for Action Recognition in Extreme Low Resolution Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 35 publications
(9 citation statements)
references
References 24 publications
0
9
0
Order By: Relevance
“…Therefore, researchers use convolution network as the basic architecture for human action recognition [21,22]. For example, Purwanto et al [23] designed a two-way selfattention network for human action recognition, which is mainly aggregated with the three-stream network in this paper to recognize human action in ultra-low-resolution video, so as to further show the time dependence between spatial-temporal features. Zhang et al [24] used two stream convolutional neural network for feature extraction, and then added a self-attention mechanism to the action recognition framework.…”
Section: Cnn-based Methods Of Action Classification Based On Attention Mechanismmentioning
confidence: 99%
“…Therefore, researchers use convolution network as the basic architecture for human action recognition [21,22]. For example, Purwanto et al [23] designed a two-way selfattention network for human action recognition, which is mainly aggregated with the three-stream network in this paper to recognize human action in ultra-low-resolution video, so as to further show the time dependence between spatial-temporal features. Zhang et al [24] used two stream convolutional neural network for feature extraction, and then added a self-attention mechanism to the action recognition framework.…”
Section: Cnn-based Methods Of Action Classification Based On Attention Mechanismmentioning
confidence: 99%
“…Afterwards, Transformer layers are trained for a downstream task on those features. With this approach, many works [52], [67], [78], [81], [86], [95], [101], [136], [141], [171] are still able to train the Transformer on small datasets (<10k training samples). However, it is definitely common to use medium to large datasets, as in [53], [54], [56], [57], [58], [59], [66], [93], [172].…”
Section: Training Regimementioning
confidence: 99%
“…[15] introduce a novel spatial-temporal multi-head self-attention mechanism based on the combination of super-resolution, knowledge distillation and attention mechanisms to learn powerful yet general features. Moreover, Purwanto et Al., in [14], address the visual degradation problem with an additional input source, namely the trajectory. This architecture complements the usual spatial-temporal information with trajectory patterns capturing features robust to visual distortion.…”
Section: Related Workmentioning
confidence: 99%
“…Exploited Data Backbone MSN eLR Three 3D conv Streams Semi-Coupled [6] eLR Two 2D Conv Streams ISR [18] LR + HR Custom 2D Conv S.T.M.H.S.A. [15] eLR + HR Double I3D TSN-BSA [14] eLR Three Stream I3D Fully Coupled [29] eLR C3D Multi-Siamese [17] eLR + HR Custom 2D Conv The reported values are the average accuracies over the three official splits.…”
Section: Architecturementioning
confidence: 99%