“…For instance, Kiran et al (2021) uses optical flow, Subramaniam et al (2019) uses co-segmentation and Bhuiyan et al (2020) uses pose guided contextual information. Unlike (Kiran et al, 2021;Bhuiyan et al, 2020) and Subramaniam et al (2019), we introduce the use of cross-modal contextual information, i.e the contextual information from one modality is processed to gate the backbone architecture of another modality. Following the common trend in Kiran et al (2021), Bhuiyan et al (2020) and Subramaniam et al (2019), we rely on a simple gated attention mechanism which allows for multiplicative interaction between the input features from one modality and the attention map from another modality.…”