2022
DOI: 10.22219/kinetik.v7i4.1568
|View full text |Cite
|
Sign up to set email alerts
|

Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model

Abstract: Image captioning is one of the biggest challenges in the fields of computer vision and natural language processing. Many other studies have raised the topic of image captioning. However, the evaluation results from other studies are still low. Thus, this study focuses on improving the evaluation results from previous studies. In this study, we used the Flickr8k dataset and the VGG16 Convolutional Neural Networks (CNN) model as an encoder to generate feature extraction from images. Recurrent Neural Network (RNN… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 19 publications
0
1
0
Order By: Relevance
“…Several approaches for image captioning have been made from deep learning encoder-decoder based models with CNN to extract the spatial and visual features and RNN to generate them in sequence [10,11]. A spectrum of encoding models has been explored to enhance image captioning systems, encompassing diverse architectures such as Inception-v3, Visual Geometry Group Network (VGGNet), Inception-v3 augmented with LSTM as a decoder [12], Residual Network 152 layer (ResNet-152) [13], and VGG-16 [14]. Notably, employing transfer learning through pre-trained encoders, commonly derived from ImageNet, has demonstrated superior outcomes [15].…”
Section: Related Workmentioning
confidence: 99%
“…Several approaches for image captioning have been made from deep learning encoder-decoder based models with CNN to extract the spatial and visual features and RNN to generate them in sequence [10,11]. A spectrum of encoding models has been explored to enhance image captioning systems, encompassing diverse architectures such as Inception-v3, Visual Geometry Group Network (VGGNet), Inception-v3 augmented with LSTM as a decoder [12], Residual Network 152 layer (ResNet-152) [13], and VGG-16 [14]. Notably, employing transfer learning through pre-trained encoders, commonly derived from ImageNet, has demonstrated superior outcomes [15].…”
Section: Related Workmentioning
confidence: 99%