Natural Language Description of Video Streams Using Task-Specific Feature Encoding

Dilawari, Aniqa; Khan, Muhammad Usman Ghani; Farooq, Ammarah; Rehman, Zahoor Ur; Rho, Seungmin; Mehmood, Irfan

doi:10.1109/access.2018.2814075

Cited by 15 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it still required incorporating carefully handled action recognition techniques to outperform the state of the art for action. Scores for the close-up features were comparable to our former experiment [33] but high compared with traditional hand-engineered techniques. However, multitask learning with basic fine-tuning had a positive impact on the overall results for video description generation.…”

Section: Discussionsupporting

confidence: 71%

“…In the TRECViD dataset, the scene-based categories of indoor or outdoor, meeting, groups and traffic had the highest scores due to the superior learning capability of the VGG network for scene and object settings. The activity category also saw a gain in performance compared with our previous experiment [33] of 12%. However, it still required incorporating carefully handled action recognition techniques to outperform the state of the art for action.…”

Section: Discussioncontrasting

confidence: 44%

See 1 more Smart Citation

Natural Language Description of Videos for Smart Surveillance

et al. 2021

Self Cite

View full text Add to dashboard Cite

After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.

show abstract

Section: Discussionsupporting

confidence: 71%

Section: Discussioncontrasting

confidence: 44%

Natural Language Description of Videos for Smart Surveillance

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…The study shows that the optical flow can still be improved if we use advanced CNN. The proposed model can be applied in video description tasks by the help of natural language description methods [22]. The proposed model can be used for smart city surveillance such as the unforseeable event detection and traffic control [23].…”

Section: Discussionmentioning

confidence: 99%

Improved two-stream model for human action recognition

Zhao

Man

Smith

et al. 2020

J Image Video Proc.

View full text Add to dashboard Cite

This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model. In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x-axis and the y-axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance.

show abstract

“…They also addressed image registration. Aniqa et al [16] proposed a framework which works by extracting the visual-based features from the frames of video by employing "Convolutional Neural Networks" (CNN). Furthermore, the framework passed the derived representations to the LSTM model.…”

Section: Literature Reviewmentioning

confidence: 99%

Detecting Video Surveillance Using VGG19 Convolutional Neural Networks

Butt¹,

Letchmunan²,

Hafinaz³

et al. 2020

IJACSA

View full text Add to dashboard Cite

The meteoric growth of data over the internet from the last few years has created a challenge of mining and extracting useful patterns from a large dataset. In recent years, the growth of digital libraries and video databases makes it more challenging and important to extract useful information from raw data to prevent and detect the crimes from the database automatically. Street crime snatching and theft detection is the major challenge in video mining. The main target is to select features/objects which usually occurs at the time of snatching. The number of moving targets imitates the performance, speed and amount of motion in the anomalous video. The dataset used in this paper is Snatch 101; the videos in the dataset are further divided into frames. The frames are labelled and segmented for training. We applied the VGG19 Convolutional Neural Network architecture algorithm and extracted the features of objects and compared them with original video features and objects. The main contribution of our research is to create frames from the videos and then label the objects. The objects are selected from frames where we can detect anomalous activities. The proposed system is never used before for crime prediction, and it is computationally efficient and effective as compared to state-of-the-art systems. The proposed system outperformed with 81 % accuracy as compared to stateof-the-art systems.

show abstract

Natural Language Description of Video Streams Using Task-Specific Feature Encoding

Cited by 15 publications

References 10 publications

Natural Language Description of Videos for Smart Surveillance

Natural Language Description of Videos for Smart Surveillance

Improved two-stream model for human action recognition

Detecting Video Surveillance Using VGG19 Convolutional Neural Networks

Contact Info

Product

Resources

About