Conventional image annotation systems can only handle those images having labels within the exist library, but cannot recognize those novel labels. In order to learn new concepts, one has to gather large amount of labeled images and train the model from scratch. More importantly, it can come with a high price to collect those labeled images. For these reasons, we put forward a zero-shot image annotation model, to reduce the demand for the images with novel labels. In this paper, we focus on the two big challenges of zero-shot image annotation: polysemous words and a strong bias in the generalized zero-shot setting. For the first problem, instead of training on large corpus datasets as previous methods, we propose to adopt Node2Vec to obtain contextualized word embeddings, which can easily produce word vectors of the polysemous words. For the second problem, we alleviate the strong bias in two ways: on one hand, we utilize a model based on graph convolutional network (GCN) to make target images involved in the training process; on the other hand, we put forward a novel semantic coherent (SC) loss to capture the semantic relations of the source and target labels. The extensive experiments on NUSWIDE, COCO, IAPR TC-12, and Corel5k datasets show the superiority of the proposed model and the annotation performance get improved by 4%-6% comparing with state-of-the-art methods.
Image downscaling and upscaling are two basic rescaling operations. Once the image is downscaled, it is difficult to be reconstructed via upscaling due to the loss of information. To make these two processes more compatible and improve the reconstruction performance, some efforts model them as a joint encoding-decoding task, with the constraint that the downscaled (i.e. encoded) low-resolution (LR) image must preserve the original visual appearance. To implement this constraint, most methods guide the downscaling module by supervising it with the bicubically downscaled LR version of the original high-resolution (HR) image. However, this bicubic LR guidance may be suboptimal for the subsequent upscaling (i.e. decoding) and restrict the final reconstruction performance. In this paper, instead of directly applying the LR guidance, we propose an additional invertible flow guidance module (FGM), which can transform the downscaled representation to the visually plausible image during downscaling and transform it back during upscaling. Benefiting from the invertibility of FGM, the downscaled representation could get rid of the LR guidance and would not disturb the downscaling-upscaling process. It allows us to remove the restrictions on the downscaling module and optimize the downscaling and upscaling modules in an end-to-end manner. In this way, these two modules could cooperate to maximize the HR reconstruction performance. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SotA) performance on both downscaled and reconstructed images. IntroductionWith the tremendous advances of mobile devices and web applications, the demand has surged for image downscaling to downscale high-resolution (HR) images to smaller-sized
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.