The appearance gap between sketches and photo-realistic images is a fundamental challenge in sketchbased image retrieval (SBIR) systems. The existence of noisy edges on photo-realistic images is a key factor in the enlargement of the appearance gap and significantly degrades retrieval performance. To bridge the gap, we propose a framework consisting of a new line segment-based descriptor named histogram of line relationship (HLR) and a new noise impact reduction algorithm known as object boundary selection. HLR treats sketches and extracted edges of photo-realistic images as a series of piece-wise line segments and captures the relationship between them. Based on the HLR, the object boundary selection algorithm aims to reduce the impact of noisy edges by selecting the shaping edges that best correspond to the object boundaries. Multiple hypotheses are generated for descriptors by hypothetical edge selection. The selection algorithm is formulated to find the best combination of hypotheses to maximize the retrieval score; a fast method is also proposed. To reduce the distraction of false matches in the scoring process, two constraints on spatial and coherent aspects are introduced. We tested the HLR descriptor and the proposed framework on public datasets and a new image dataset of three million images, which we recently collected for SBIR evaluation purposes. We compared the proposed HLR with state-of-the-art descriptors (SHoG, GF-HOG). The experimental results show that our HLR descriptor outperforms them. Combined with the object boundary selection algorithm, our framework significantly improves SBIR performance.Index Terms-Large-scale sketch retrieval, line segment-based descriptor, object boundary selection.
Action recognition has already been a heated research topic recently, which attempts to classify different human actions in videos. The current main-stream methods generally utilize ImageNet-pretrained model as features extractor, however it’s not the optimal choice to pretrain a model for classifying videos on a huge still image dataset. What’s more, very few works notice that 3D convolution neural network(3D CNN) is better for low-level spatial-temporal features extraction while recurrent neural network(RNN) is better for modelling high-level temporal feature sequences. Consequently, a novel model is proposed in our work to address the two problems mentioned above. First, we pretrain 3D CNN model on huge video action recognition dataset Kinetics to improve generality of the model. And then long short term memory(LSTM) is introduced to model the high-level temporal features produced by the Kinetics-pretrained 3D CNN model. Our experiments results show that the Kinetics-pretrained model can generally outperform ImageNet-pretrained model. And our proposed network finally achieve leading performance on UCF-101 dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.