Semantic Pooling for Complex Event Analysis in Untrimmed Videos

Chang, Xiaojun; Yu, Yaoliang; Yang, Yi; Xing, Eric P.

doi:10.1109/tpami.2016.2608901

Cited by 313 publications

(77 citation statements)

References 60 publications

(63 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of the traffic sign detection methods are focused on images. Another focus of study can be to analyze traffic videos for traffic sign detection by leveraging the semantic representations [20,21]. Yet another focus could consider mining the correlations between the features of traffic signs by a semi-supervised feature selection framework [22] in traffic videos.…”

Section: Related Workmentioning

confidence: 99%

A Real-Time Chinese Traffic Sign Detection Algorithm Based on Modified YOLOv2

et al. 2017

View full text Add to dashboard Cite

Abstract:Traffic sign detection is an important task in traffic sign recognition systems. Chinese traffic signs have their unique features compared with traffic signs of other countries. Convolutional neural networks (CNNs) have achieved a breakthrough in computer vision tasks and made great success in traffic sign classification. In this paper, we present a Chinese traffic sign detection algorithm based on a deep convolutional network. To achieve real-time Chinese traffic sign detection, we propose an end-to-end convolutional network inspired by YOLOv2. In view of the characteristics of traffic signs, we take the multiple 1 × 1 convolutional layers in intermediate layers of the network and decrease the convolutional layers in top layers to reduce the computational complexity. For effectively detecting small traffic signs, we divide the input images into dense grids to obtain finer feature maps. Moreover, we expand the Chinese traffic sign dataset (CTSD) and improve the marker information, which is available online. All experimental results evaluated according to our expanded CTSD and German Traffic Sign Detection Benchmark (GTSDB) indicate that the proposed method is the faster and more robust. The fastest detection speed achieved was 0.017 s per image.

show abstract

Section: Related Workmentioning

confidence: 99%

A Real-Time Chinese Traffic Sign Detection Algorithm Based on Modified YOLOv2

et al. 2017

View full text Add to dashboard Cite

show abstract

“…For general poolings, three popular approaches were surveyed, i.e., sum pooling [66][67][68][69][70], average pooling [71][72][73][74][75][76], and max pooling [77][78][79][80]. For particular poolings, another three popular approaches were surveyed, i.e., stochastic pooling [81], semantic pooling [82], and multi-scale pooling [83][84][85][86].…”

Section: Feature Encoding and Pooling Taxonomymentioning

confidence: 99%

“…For complex event detection in long internet videos with few relevant shots, traditional pooling strategies treat usually each shot equally and cannot aggregate the shots based on their relevance with respect to the event of interest [82]. Chang et al [82] proposed a semantic pooling approach to prioritize CNN shot outputs according to their semantic saliencies.…”

Section: Semantic Poolingmentioning

confidence: 99%

See 1 more Smart Citation

Feature Encodings and Poolings for Action and Event Recognition: A Comprehensive Survey

Liu

Zhang

et al. 2017

Information

View full text Add to dashboard Cite

Action and event recognition in multimedia collections is relevant to progress in cross-disciplinary research areas including computer vision, computational optimization, statistical learning, and nonlinear dynamics. Over the past two decades, action and event recognition has evolved from earlier intervening strategies under controlled environments to recent automatic solutions under dynamic environments, resulting in an imperative requirement to effectively organize spatiotemporal deep features. Consequently, resorting to feature encodings and poolings for action and event recognition in complex multimedia collections is an inevitable trend. The purpose of this paper is to offer a comprehensive survey on the most popular feature encoding and pooling approaches in action and event recognition in recent years by summarizing systematically both underlying theoretical principles and original experimental conclusions of those approaches based on an approach-based taxonomy, so as to provide impetus for future relevant studies.

show abstract

“…As a result, cross-modal retrieval attracts increasing attention and plays an important role to describe the content of an image with natural language and conversely retrieve image given textual query Pereira and Vasconcelos (2014); Amir et al (2004); Chang et al (2017a). However, since data in diverse modalities are presented in heterogeneous feature spaces and usually have varying statistical properties, it is a significant challenge to bridge the heterogeneity-gap between multi-modal data Grangier and Bengio (2008); Ranjan et al (2015).…”

Section: Introductionmentioning

confidence: 99%

Simple to complex cross-modal learning to rank

Luo

Chang

et al. 2017

Computer Vision and Image Understanding

Self Cite

View full text Add to dashboard Cite

The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear mapping functions which might not be sophisticated enough to reveal more complicated inter-modal correspondences. Additionally, current studies assume that the rankings are of equal importance, and thus all rankings are used simultaneously, or a small number of rankings are selected randomly to train the embedding space at each iteration. Such strategies, however, always suffer from outliers as well as reduced generalization capability due to their lack of insightful understanding of procedure of human cognition. In this paper, we involve the self-paced learning theory with diversity into the cross-modal learning to rank and learn an optimal multi-modal embedding space based on non-linear mapping functions. This strategy enhances the model's robustness to outliers and achieves better generalization via training the model gradually from easy rankings by diverse queries to more complex ones. An efficient alternative algorithm is exploited to solve the proposed challenging problem with fast convergence in practice. Extensive experimental results on several benchmark datasets indicate that the proposed method achieves significant improvements over the state-of-the-arts in this literature.

show abstract

Semantic Pooling for Complex Event Analysis in Untrimmed Videos

Cited by 313 publications

References 60 publications

A Real-Time Chinese Traffic Sign Detection Algorithm Based on Modified YOLOv2

A Real-Time Chinese Traffic Sign Detection Algorithm Based on Modified YOLOv2

Feature Encodings and Poolings for Action and Event Recognition: A Comprehensive Survey

Simple to complex cross-modal learning to rank

Contact Info

Product

Resources

About