DLSTM approach to video modeling with hashing for large-scale video retrieval

Zhuang, Naifan; Ye, Jun; Hua, Kien A.

doi:10.1109/icpr.2016.7900131

Cited by 9 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There have been several methods that adapted the use of 2D CNNs along with a sequential data processing NN layer(s) in addition to additional losses to obtain a video hashing deep model that hashes in an end-to-end manner [11,14,15,[26][27][28][29][30][31]. The additional sequential data processing NNs are used to obtain temporal features that are not extracted from the CNNs.…”

Section: Content Based Video Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

Deep Video Hashing Using 3DCNN with BERT

2022

IJIES

View full text Add to dashboard Cite

Deep video hashing (DVH) is a very appealing way to decrease storage costs and query times. In this work we propose a hashing model using two separated modules. A 3DCNN is proposed with a bidirectional encoder representations from transformers (BERT) layer. And a hashing neural network (NN) module will learn to encode those features into hash codes. The proposed model that separates feature extraction from hash generation process results in better performance with respect to training time consumption and accuracy. We achieve a significant improvement in video retrieval performance on two benchmark datasets compared to state-of-the-art deep learning models for video retrieval that use convolutional neural networks (CNN)s or 3DCNNs along with other temporal feature extraction techniques and supervised hashing methods. For UCF101, HMDB51 datasets, more than 2 % mAP and 24 % improvement is achieved respectively for tested bit sizes.

show abstract

Section: Content Based Video Retrievalmentioning

confidence: 99%

“…In [26] after extracting features using VGG19 an attention-based LSTM is used to further process the features then a fully connected (FC) layer to get the hashes. [27] uses differential LSTM (DLSTM) along with a variation of AlexNet to encode the features into hashes.…”

Section: Content Based Video Retrievalmentioning

confidence: 99%

Deep Video Hashing Using 3DCNN with BERT

2022

IJIES

View full text Add to dashboard Cite

show abstract

“…Long Short-Term Memory(LSTM) network, a typical type of recurrent neural networks(RNN) architecture, is proposed by Hochreiter et al [16] and widely used in many research tasks. Take video for example, this network has been applied in action recognition [17] [18] [19], video retrieval [20] [21], video segmentation [22] [23] and Video Captioning [24] [25], etc. LSTM-Autoencoder, a typical sequence-to-sequence [26] framework, is proposed by Srivastava et al [17] and applied for learning video action recognition.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Anomaly Video Detection via a Double-Flow ConvLSTM Variational Autoencoder

et al. 2022

View full text Add to dashboard Cite

With the rapid increase of video surveillance points in the market in recent years, video anomaly detection has gained extensive attention in the security field. At present, the distribution of normal and anomalous data is unbalanced in unlabeled video data. Variational autoencoder (VAE), as one of the typical deep generative models, gets increasingly popular in unsupervised anomaly detection. However, this model is not good at processing time-series data, especially video data. In addition, the strong generalization ability which is over-reconstructing anomaly behavior of many autoencoder-based works leads to the missed anomaly detection. To solve these problems, in this paper, we present a doubleflow convolutional long short-term memory variational autoencoder (DF-ConvLSTM-VAE) to model the probabilistic distribution of the normal video in an unsupervised learning scheme, and to reconstruct videos without anomaly objects for anomaly video detection. Experiments verify the effectiveness and competitiveness of our DF-ConvLSTM-VAE on multiple public benchmark datasets. In particular, our model achieves the state-of-the-art performance on anomalous event count.

show abstract

“…Raw frame representations obtained from a CNN were fed to an LSTM, max-pooling, and fully connected layer to attain the fixedlength hash codes. For reducing the feature size to support massive video databases, Zhuang et al [82] proposed using a differential LSTM (DLSTM) [83] for modeling videos. They extract one video segment to generate a highly compact fixed-length representation of the original video.…”

Section: Image Retrievalmentioning

confidence: 99%

A system for large-scale image and video retrieval on everyday scenes

Zachariah¹

View full text Add to dashboard Cite

There has been a growing amount of multimedia data generated on the web todayin terms of size and diversity. This has made accurate content retrieval with these large and complex collections of data a challenging problem. Motivated by the need for systems that can enable scalable and efficient search, we propose QIK (Querying Images Using Contextual Knowledge). QIK leverages advances in deep learning (DL) and natural language processing (NLP) for scene understanding to enable large-scale multimedia retrieval on everyday scenes with common objects. The system consists of three major components: Indexer, Query Processor, and Video Processor. Given an image, the Indexer performs probabilistic image understanding (PIU). The PIU generated consists of the most probable captions, parsed and represented by tree structures using NLP techniques, and detected objects. The PIU's are stored and indexed in a database system. For a query image, the Query Processor generates the most probable caption and parses it into the corresponding tree structure. Then an optimized tree-pattern query is constructed and executed on the database to retrieve a set of candidate images. The candidate images fetched are ranked using the tree-edit distance metric computed on the tree structures. Given a video, the Video Processor extracts a sequence of key scenes that are posed to the Query Processor to retrieve a set of candidate scenes. The candidate scene parse trees corresponding to a video are extracted and are ranked based on the number of matching scenes. We evaluated the performance of our system for large-scale image and video retrieval tasks on datasets containing everyday scenes and observed that our system could outperform state-ofthe- art techniques in terms of mean average precision.

show abstract

DLSTM approach to video modeling with hashing for large-scale video retrieval

Cited by 9 publications

References 17 publications

Deep Video Hashing Using 3DCNN with BERT

Deep Video Hashing Using 3DCNN with BERT

Unsupervised Anomaly Video Detection via a Double-Flow ConvLSTM Variational Autoencoder

A system for large-scale image and video retrieval on everyday scenes

Contact Info

Product

Resources

About