We investigate the problem of weakly-supervised video object grounding (WSVOG), where only the video-sentence annotations are provided for training. It aims at localizing the queried objects described in the sentence to visual regions in the video. Despite the recent progress, existing approaches have not fully exploited the potential of the description sentences for cross-modal alignment in two aspects: (1) Most of them extract objects from the description sentences and represent them with fixed textual representations. While achieving promising results, they do not make full use of the contextual information in the sentence. (2) A few works have attempted to utilize contextual information to learn object representations, but found a significant decrease in performance due to the unstable training in cross-modal alignment. To address the above issues, in this paper, we propose a Stable Context Learning (SCL) framework for WSVOG which jointly enjoys the merits of stable learning and rich contextual information. Specifically, we design two modules named Context-Aware Object Stabilizer module and Cross-Modal Alignment Knowledge Transfer module, which are cooperated together to inject contextual information to stable object concepts in text modality and transfer contextualized knowledge in cross-modal alignment. Our approach is finally optimized under a frame-level MIL paradigm. Extensive experiments on three popular benchmarks demonstrate its significant effectiveness.
CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval.