Image Caption using VGG model and LSTM

Image captioning aims at generating meaningful verbal descriptions of a digital image. This domain is rapidly growing due to the enormous increase in available computational resources. The most advanced methods are, however, resource-demanding. In our paper, we return to the encoder–decoder deep-learning model and investigate how replacing its components with newer equivalents improves overall effectiveness. The primary motivation of our study is to obtain the highest possible level of improvement of classic methods, which are applicable in less computational environments where most advanced models are too heavy to be efficiently applied. We investigate image feature extractors, recurrent neural networks, word embedding models, and word generation layers and discuss how each component influences the captioning model’s overall performance. Our experiments are performed on the MS COCO 2014 dataset. As a result of our research, replacing components improves the quality of generating image captions. The results will help design efficient models with optimal combinations of their components.

show abstract

Image Caption using VGG model and LSTM

Cited by 2 publications

References 12 publications

Research on Video Anti-hotlinking for OTT

Research on Video Anti-hotlinking for OTT

The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning

Contact Info

Product

Resources

About