Automatically describing video content with natural language has been attracting much attention in CV and NLP communities. Most existing methods predict one word at a time, and by feeding the last generated word back as input at the next time, while the other generated words are not fully exploited. Furthermore, traditional methods optimize the model using all the training samples in each epoch without considering their learning situations, which leads to a lot of unnecessary training and can not target the difficult samples. To address these issues, we propose a text-based dynamic attention model named TDAM, which imposes a dynamic attention mechanism on all the generated words with the motivation to improve the context semantic information and enhance the overall control of the whole sentence. Moreover, the text-based dynamic attention mechanism and the visual attention mechanism are linked together to focus on the important words. They can benefit from each other during training. Accordingly, the model is trained through two steps: "starting from scratch" and "checking for gaps". The former uses all the samples to optimize the model, while the latter only trains for samples with poor control. Experimental results on the popular datasets MSVD and MSR-VTT demonstrate that our non-ensemble model outperforms the state-ofthe-art video captioning benchmarks.
The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as ''of'' and ''the'', may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, the model uses the previously generated words to predict the next word, while it maximizes the likelihood of the next ground-truth word with consideration of the true previous one during training. To address these issues, we propose the reinforced adaptive attention model (RAAM), which integrates an adaptive attention mechanism with long short-term memory to flexibly utilize visual signals and language information as needed. Accordingly, the model is trained with both word-level loss and sentence-level loss to take advantage of these two losses and alleviate the exposure bias problem by directly optimizing the sentence-level metric using a reinforcement learning algorithm. Besides, a novel training method is proposed for mixed loss optimization. Experiments on the Microsoft Video Description benchmark corpus (MSVD) and the challenging MPII-MD Movie Description dataset demonstrate that the proposed RAAM method, which uses only a single feature, achieves competitive or even superior results compared to existing stateof-the-art models for video captioning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.