Video captioning is the interesting task of encoding features and decoding the encoded features into natural language. Video captioning task is the perfect blend of computer vision and Natural language processing techniques. Video captioning process is a tentative mission since it must consider temporal features along with spatial features in order to generate appropriate captions. This task has the default framework as encoder-decoder. Ensuring the optimality of the generated caption for the particular video is the most important and essential thing which is not considered by most of the existing works. In order to ensure the optimality of captions for videos, our proposed work introduces confiner to the default framework. Confiner is planned to diminish the semantic gap among the generated caption and videos. Confiner is designed in our model with both LSTM and GRU separately as two different models. The actual work of the confiner is to inspect the visuals of the generated caption with the actual video. The proposed model has experimented with three different benchmark datasets as MSVD, MSR-VTT, and M-VAD. The performances of the models are measured by BLEU, METEOR, CIDEr evaluation metrics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.