2022
DOI: 10.1371/journal.pone.0265115
|View full text |Cite
|
Sign up to set email alerts
|

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video

Abstract: Most deep learning-based action recognition models focus only on short-term motions, so the model often causes misjudgments of actions that are combined by multiple processes, such as long jump, high jump, etc. The proposal of Temporal Segment Networks (TSN) enables the network to capture long-term information in the video, but ignores that some unrelated frames or areas in the video can also cause great interference to action recognition. To solve this problem, a soft attention mechanism is introduced in TSN … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(4 citation statements)
references
References 41 publications
0
4
0
Order By: Relevance
“…Furthermore, our employed boosted CSAA model has demonstrated competitive performance in contrast to multi‐stream CNN and LSTM‐based models [51, 88, 89]. The method outperforms the attention modules proposed in residual CNN structure from only RGB frames or in combination with optical flows [93]. The introduced encoded motion information additionally outperforms various two‐stream‐based methods, which typically involve stacking optical flow data as a separate stream to enhance performance [53].…”
Section: Experiments and Discussionmentioning
confidence: 99%
“…Furthermore, our employed boosted CSAA model has demonstrated competitive performance in contrast to multi‐stream CNN and LSTM‐based models [51, 88, 89]. The method outperforms the attention modules proposed in residual CNN structure from only RGB frames or in combination with optical flows [93]. The introduced encoded motion information additionally outperforms various two‐stream‐based methods, which typically involve stacking optical flow data as a separate stream to enhance performance [53].…”
Section: Experiments and Discussionmentioning
confidence: 99%
“…For example, when TSN is plugged into GSM [3], an accuracy improvement of 32% is achieved. Furthermore, Yang et al [4] used TSN with a soft attention mechanism to capture important frames from each segment. Moreover, Zhang et al [5] have used the TSN model as a feature extractor with ResNet101 for efficient behavior recognition of pigs.…”
Section: Multimodal Recognition Methodsmentioning
confidence: 99%
“…Interpretable spatio-temporal attention [48] used spatial and temporal attention via ConvLSTM. Recent selfattention mechanisms are also introduced in STA-TSN [49] and GTA [50], as well as Transformer-based video models [3]. Although some of these methods do not aim to visual explanation, the blurry map issue still remains for videos because the ability of temporal modeling, which is useful for classification, may be harmful to capture sharp spatial attention maps.…”
Section: Related Workmentioning
confidence: 99%