Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of 92.3% on CUB-200-2011, 85.4% on Birdsnap, 93.4% on FGVC-Aircraft, and 80.8% on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets.
This paper presents a method for movie genre categorization of movie trailers, based on scene categorization. We view our approach as a step forward from using only low-level visual feature cues, towards the eventual goal of high-level semantic understanding of feature films. Our approach decomposes each trailer into a collection of keyframes through shot boundary analysis. From these keyframes, we use state-ofthe-art scene detectors and descriptors to extract features, which are then used for shot categorization via unsupervised learning. This allows us to represent trailers using a bag-of-visual-words (bovw) model with shot classes as vocabularies. We approach the genre classification task by mapping bovw temporally structured trailer features to four high-level movie genres: action, comedy, drama or horror films. We have conducted experiments on 1239 annotated trailers. Our experimental results demonstrate that exploiting scene structures improves film genre classification compared to using only low-level visual features.
Most deep architectures for image classification-even those that are trained to classify a large number of diverse categories-learn shared image representations with a single model. Intuitively, however, categories that are more similar should share more information than those that are very different. While hierarchical deep networks address this problem by learning separate features for subsets of related categories, current implementations require simplified models using fixed architectures specified via heuristic clustering methods. Instead, we propose Blockout, a method for regularization and model selection that simultaneously learns both the model architecture and parameters. A generalization of Dropout, our approach gives a novel parametrization of hierarchical architectures that allows for structure learning via back-propagation. To demonstrate its utility, we evaluate Blockout on the CIFAR and ImageNet datasets, demonstrating improved classification accuracy, better regularization performance, faster training, and the clear emergence of hierarchical network structures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.