In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.In this paper, we discover that deconfounded learning is an important step to provide better vision-language explanations. Thus we propose a deconfounded explainable vision-language retrieval system. By introducing deconfounded learning to pretrain our vision-language model, the spurious correlations in the model can be reduced through interventions by potential confounders. This helps to train more accurate representations and further enable better explainability. Based on explainable retrieval results, we propose novel interactive mechanisms. In such interactions, users can better understand why the system returns particular results and give feedback effectively improving the results. This additional feedback is sample efficient and thus alleviates the data limitation problem. Through extensive experiments, our system achieves about 60% improvements, compared to the state-of-the-art.
CCS CONCEPTS• Computing methodologies → Causal reasoning and diagnostics; • Information systems → Users and interactive retrieval; Retrieval efficiency; Presentation of retrieval results; Image search.
We study the problem of composition learning for image retrieval, for which we learn to retrieve target images with search queries in the form of a composition of a reference image and a modification text that describes desired modifications of the image. Existing models of composition learning for image retrieval are generally built with large-scale datasets, demanding extensive training samples, i.e., query-target pairs, as supervision, which restricts their application for the scenario of few-shot learning with only few query-target pairs available. Recently, prompt tuning with frozen pretrained language models has shown remarkable performance when the amount of training data is limited. Inspired by this, we propose a prompt tuning mechanism with the pretrained CLIP model for the task of few-shot composition learning for image retrieval. Specifically, we regard the representation of the reference image as a trainable visual prompt, prefixed to the embedding of the text sequence. One challenge is to efficiently train visual prompt with few-shot samples. To deal with this issue, we further propose a self-upervised auxiliary task via ensuring that the reference image can retrieve itself when no modification information is given from the text, which facilitates training for the visual prompt, while not requiring additional annotations for query-target pairs. Experiments on multiple benchmarks show that our proposed model can yield superior performance when trained with only few query-target pairs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.