2021
DOI: 10.1109/tcsvt.2020.2984569
|View full text |Cite
|
Sign up to set email alerts
|

A Real-Time Action Representation With Temporal Encoding and Deep Compression

Abstract: Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Specifically, we propose a residual 3D … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 35 publications
(10 citation statements)
references
References 45 publications
0
10
0
Order By: Relevance
“…e study in [34] found that video streaming data has a spatiotemporal correlation feature, which makes the video streams between connected regions influence each other. While increasing the spatiotemporal feature extraction of video stream sequences, this paper proposes to enhance the robustness and convergence of the CNN-LSTM prediction model using the RAdam optimization algorithm [24], which reduces the prediction error of the proposed model while reducing the model training time, and finally obtains a good prediction accuracy [35].…”
Section: Cnn-lstm Modelmentioning
confidence: 99%
“…e study in [34] found that video streaming data has a spatiotemporal correlation feature, which makes the video streams between connected regions influence each other. While increasing the spatiotemporal feature extraction of video stream sequences, this paper proposes to enhance the robustness and convergence of the CNN-LSTM prediction model using the RAdam optimization algorithm [24], which reduces the prediction error of the proposed model while reducing the model training time, and finally obtains a good prediction accuracy [35].…”
Section: Cnn-lstm Modelmentioning
confidence: 99%
“…They have worked through enriching a considerable portion of an existing dataset with spatial and temporal domains with annotations, using spatiotemporal tubes instead of whole-frame video segments. Liu et al (2020) have performed real time action representation using temporal encoding along with deep compression network. Their framework extracts video representations in a hierarchical granularity manner and use residual 3D CNN to extract appearance-based data.…”
Section: Related Workmentioning
confidence: 99%
“…T-C3D, a framework proposed by Liu et al [26], first divides the given input video into three clips and samples eight frames from each clip. Thus, it can capture short-term features from frames in the same clips using 3D kernels and long-term features when fusing the prediction from each clip.…”
Section: D Cnnmentioning
confidence: 99%