2019
DOI: 10.1109/tpami.2018.2868668
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Segment Networks for Action Recognition in Videos

Abstract: Deep convolutional networks have achieved great success for image recognition. However, for action recognition in videos, their advantage over traditional methods is not so evident. We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structures with a new segment-based sampling and aggregation module. This unique design enables our TSN to efficiently learn action models by using the w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
464
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 689 publications
(467 citation statements)
references
References 68 publications
3
464
0
Order By: Relevance
“…In [34], the famous two-stream architecture is devised by applying two 2D CNN architectures separately on visual frames and staked optical flows. This two-stream architecture is further extended by exploiting convolutional fusion [5], spatio-temporal attention [24], temporal segment networks [41,42] and convolutional encoding [4,27] for video representation learning. Ng et al [49] highlight the drawback of performing 2D CNN on video frames, in which long-term dependencies cannot be captured by two-stream network.…”
Section: Related Workmentioning
confidence: 99%
“…In [34], the famous two-stream architecture is devised by applying two 2D CNN architectures separately on visual frames and staked optical flows. This two-stream architecture is further extended by exploiting convolutional fusion [5], spatio-temporal attention [24], temporal segment networks [41,42] and convolutional encoding [4,27] for video representation learning. Ng et al [49] highlight the drawback of performing 2D CNN on video frames, in which long-term dependencies cannot be captured by two-stream network.…”
Section: Related Workmentioning
confidence: 99%
“…Video action recognition. Without rules for logical reasoning, many approaches often employ hand-crafted [19,24,34,43] or deeplearned features [8,9,23,36,44,45] of appearance and motion for action recognition. Recently, researchers attempt to use the semantic-level state changes [1,7,10,25,49,50] for video analysis.…”
Section: Related Workmentioning
confidence: 99%
“…Note that there are essential differences between the proposed action reasoning approach and many deep learning based action recognition methods [8,9,23,36,44,45]: (1) Instead of only predicting a single action label, our method outputs multiple action labels with relevant objects, attributes/relationships and the time of each state transition. (2) Our action models are learned from semanticlevel state transitions based definitions (state detectors are trained on still images), and thus it does not need well-annotated video clips for training.…”
Section: Action Recognition Accuracymentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, I3D architecture [9] was proposed as an improvement of two-stream networks. In another work, Wang et al [53] have proposed Temporal Segment Networks (TSNs) with the purpose of solving the long-range temporal limitations of twostream networks by using temporal sampling. More recently, Choutas et al [12] proposed PoTion representations for human action recognition.…”
Section: Related Workmentioning
confidence: 99%