In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.In this paper, we discover that deconfounded learning is an important step to provide better vision-language explanations. Thus we propose a deconfounded explainable vision-language retrieval system. By introducing deconfounded learning to pretrain our vision-language model, the spurious correlations in the model can be reduced through interventions by potential confounders. This helps to train more accurate representations and further enable better explainability. Based on explainable retrieval results, we propose novel interactive mechanisms. In such interactions, users can better understand why the system returns particular results and give feedback effectively improving the results. This additional feedback is sample efficient and thus alleviates the data limitation problem. Through extensive experiments, our system achieves about 60% improvements, compared to the state-of-the-art.
CCS CONCEPTS• Computing methodologies → Causal reasoning and diagnostics; • Information systems → Users and interactive retrieval; Retrieval efficiency; Presentation of retrieval results; Image search.