2018
DOI: 10.1109/access.2018.2814075
|View full text |Cite
|
Sign up to set email alerts
|

Natural Language Description of Video Streams Using Task-Specific Feature Encoding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
7
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 10 publications
1
7
1
Order By: Relevance
“…However, it still required incorporating carefully handled action recognition techniques to outperform the state of the art for action. Scores for the close-up features were comparable to our former experiment [33] but high compared with traditional hand-engineered techniques. However, multitask learning with basic fine-tuning had a positive impact on the overall results for video description generation.…”
Section: Discussionsupporting
confidence: 71%
See 1 more Smart Citation
“…However, it still required incorporating carefully handled action recognition techniques to outperform the state of the art for action. Scores for the close-up features were comparable to our former experiment [33] but high compared with traditional hand-engineered techniques. However, multitask learning with basic fine-tuning had a positive impact on the overall results for video description generation.…”
Section: Discussionsupporting
confidence: 71%
“…In the TRECViD dataset, the scene-based categories of indoor or outdoor, meeting, groups and traffic had the highest scores due to the superior learning capability of the VGG network for scene and object settings. The activity category also saw a gain in performance compared with our previous experiment [33] of 12%. However, it still required incorporating carefully handled action recognition techniques to outperform the state of the art for action.…”
Section: Discussioncontrasting
confidence: 44%
“…The study shows that the optical flow can still be improved if we use advanced CNN. The proposed model can be applied in video description tasks by the help of natural language description methods [22]. The proposed model can be used for smart city surveillance such as the unforseeable event detection and traffic control [23].…”
Section: Discussionmentioning
confidence: 99%
“…They also addressed image registration. Aniqa et al [16] proposed a framework which works by extracting the visual-based features from the frames of video by employing "Convolutional Neural Networks" (CNN). Furthermore, the framework passed the derived representations to the LSTM model.…”
Section: Literature Reviewmentioning
confidence: 99%