“…Differentiable attention mechanisms enable a neural network to focus more on relevant elements of the input than on irrelevant parts. With their popularity in the field of natural language processing [8,39,43,49,60], attention modeling is rapidly adopted in various computer vision tasks, such as image recognition [14,23,58,66,73], domain adaptation [67,83], human pose estimation [9,63,77], object detection [4] and image generation [76,81,86]. Further, co-attention mechanisms become an essential tool in many vision-language applications and sequential modeling tasks, such as visual question answering [41,44,75,78], visual dialog [74,84], vision-language navigation [68], and video segmentation [42,61], showing its effectiveness in capturing the underlying relations between different entities.…”