“…With their popularity in the field of natural language processing [8,39,43,49,60], attention modeling is rapidly adopted in various computer vision tasks, such as image recognition [14,23,58,66,73], domain adaptation [67,83], human pose estimation [9,63,77], object detection [4] and image generation [76,81,86]. Further, co-attention mechanisms become an essential tool in many vision-language applications and sequential modeling tasks, such as visual question answering [41,44,75,78], visual dialog [74,84], vision-language navigation [68], and video segmentation [42,61], showing its effectiveness in capturing the underlying relations between different entities. Inspired by the general idea of attention mechanisms, this work leverages co-attention to mine semantic relations within training image pairs, which helps the classifier network learn complete object patterns and generate precise object localization maps.…”