“…Multi-modal grounding 1 tasks (e.g., phrase localization [1,3,9,41,49], referring expression comprehension [17,19,24,26,29,30,37,51,52,55,56] and segmentation [6,18,20,21,29,38,53,56]) aim to generalize traditional object detection and segmentation to localization of regions (rectangular or at a pixel level) in images that correspond to free-form linguistic expressions. These tasks have emerged as core problems in vision and ML due to the breadth of applications that can make use of such techniques, spanning image captioning, visual question answering, visual reasoning and others.…”