Temporal Segment Networks for Action Recognition in Videos

Wang, Limin; Xiong, Yuanjun; Wang, Zhe; Qiao, Yu; Lin, Dahua; Tang, Xiaoou; Gool, Luc Van

doi:10.1109/tpami.2018.2868668

Cited by 689 publications

(467 citation statements)

References 68 publications

Supporting

Mentioning

464

Contrasting

Order By: Relevance

“…In [34], the famous two-stream architecture is devised by applying two 2D CNN architectures separately on visual frames and staked optical flows. This two-stream architecture is further extended by exploiting convolutional fusion [5], spatio-temporal attention [24], temporal segment networks [41,42] and convolutional encoding [4,27] for video representation learning. Ng et al [49] highlight the drawback of performing 2D CNN on video frames, in which long-term dependencies cannot be captured by two-stream network.…”

Section: Related Workmentioning

confidence: 99%

Learning Spatio-Temporal Representation With Local and Global Diffusion

Qiu

Yao

Ngo

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

179

View full text Add to dashboard Cite

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD). Specifically, we construct a novel neural network architecture that learns the local and global representations in parallel. The architecture is composed of LGD blocks, where each block updates local and global features by modeling the diffusions between these two representations. Diffusions effectively interact two aspects of information, i.e., localized and holistic, for more powerful way of representation learning. Furthermore, a kernelized classifier is introduced to combine the representations from two aspects for video recognition. Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5% and 0.7%. We further examine the generalization of both the global and local representations produced by our pretrained LGD networks on four different benchmarks for video action recognition and spatio-temporal action detection tasks. Superior performances over several state-of-theart techniques on these benchmarks are reported. Code is available at: https://github.com/ZhaofanQiu/ local-and-global-diffusion-networks.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning Spatio-Temporal Representation With Local and Global Diffusion

Qiu

Yao

Ngo

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

179

View full text Add to dashboard Cite

show abstract

“…Video action recognition. Without rules for logical reasoning, many approaches often employ hand-crafted [19,24,34,43] or deeplearned features [8,9,23,36,44,45] of appearance and motion for action recognition. Recently, researchers attempt to use the semantic-level state changes [1,7,10,25,49,50] for video analysis.…”

Section: Related Workmentioning

confidence: 99%

“…Note that there are essential differences between the proposed action reasoning approach and many deep learning based action recognition methods [8,9,23,36,44,45]: (1) Instead of only predicting a single action label, our method outputs multiple action labels with relevant objects, attributes/relationships and the time of each state transition. (2) Our action models are learned from semanticlevel state transitions based definitions (state detectors are trained on still images), and thus it does not need well-annotated video clips for training.…”

Section: Action Recognition Accuracymentioning

confidence: 99%

“…We compare our method to a representative two-stream (appearance and motion) action recognition algorithm TSN [45], which adopts an end-to-end deep learning scheme that utilizes RGB frames and optical flow as a two-stream input. It is worth mentioning that TSN achieves the state-of-the-art performance 94.9% and 89.6% on the benchmark action recognition dataset UCF 101 [38] and ActivityNet [6], respectively.…”

Section: Action Recognition Accuracymentioning

confidence: 99%

See 1 more Smart Citation

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

Zhuo

Cheng

Zhang

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are linked by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.

show abstract

“…Recently, I3D architecture [9] was proposed as an improvement of two-stream networks. In another work, Wang et al [53] have proposed Temporal Segment Networks (TSNs) with the purpose of solving the long-range temporal limitations of twostream networks by using temporal sampling. More recently, Choutas et al [12] proposed PoTion representations for human action recognition.…”

Section: Related Workmentioning

confidence: 99%

Temporal Accumulative Features for Sign Language Recognition

Kındıroğlu

Özdemir

Akarun

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

In this paper, we propose a set of features called temporal accumulative features (TAF) for representing and recognizing isolated sign language gestures. By incorporating sign language specific constructs to better represent the unique linguistic characteristic of sign language videos, we have devised an efficient and fast SLR method for recognizing isolated sign language gestures. The proposed method is an HSV based accumulative video representation where keyframes based on the linguistic movement-hold model are represented by different colors. We also incorporate hand shape information and using a small scale convolutional neural network, demonstrate that sequential modeling of accumulative features for linguistic subunits improves upon baseline classification results.

show abstract

Temporal Segment Networks for Action Recognition in Videos

Cited by 689 publications

References 68 publications

Learning Spatio-Temporal Representation With Local and Global Diffusion

Learning Spatio-Temporal Representation With Local and Global Diffusion

Explainable Video Action Reasoning via Prior Knowledge and State Transitions

Temporal Accumulative Features for Sign Language Recognition

Contact Info

Product

Resources

About