“…As the number of videos from Internet Protocol cameras (IP cameras) for video surveillance increases, understanding video contents such as action recognition [3,16,21] and action localization [14,36] becomes crucial. Moreover, videos with textual descriptions (e.g., titles, captions, or keywords) have encouraged research on multi-modal problems such as video captioning [18,25] and temporal video grounding [4,6,22,35].…”