Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to reweight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.
CCS CONCEPTS• Information systems → Novelty in information retrieval.