Tinne Tuytelaars † ESAT-PSI-VISICS, iMinds MMT Katholieke Universiteit Leuven, Belgium Figure 1: Sketches are a high-level representation that does not always convey enough information to distinguish between different categories of objects. Ambiguous sketches -as the ones above -pose a challenge in evaluating computer methods for sketch recognition. AbstractWe introduce an approach for sketch classification based on Fisher vectors that significantly outperforms existing techniques. For the TU-Berlin sketch benchmark [Eitz et al. 2012a], our recognition rate is close to human performance on the same task. Motivated by these results, we propose a different benchmark for the evaluation of sketch classification algorithms. Our key idea is that the relevant aspect when recognizing a sketch is not the intention of the person who made the drawing, but the information that was effectively expressed. We modify the original benchmark to capture this concept more precisely and, as such, to provide a more adequate tool for the evaluation of sketch classification techniques. Finally, we perform a classification-driven analysis which is able to recover semantic aspects of the individual sketches, such as the quality of the drawing and the importance of each part of the sketch for the recognition.
We introduce a new approach for segmentation and label transfer in sketches that substantially improves the state of the art. We build on successful techniques to find how likely each segment is to belong to a label, and use a Conditional Random Field to find the most probable global configuration. Our method is trained fully on the sketch domain, such that it can handle abstract sketches that are very far from 3D meshes. It also requires a small quantity of annotated data, which makes it easily adaptable to new datasets. The testing phase is completely automatic, and our performance is comparable to state-of-the-art methods that require manual tuning and a considerable amount of previous annotation [Huang et al. 2014].
Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.