Image Captioning with Attention for Smart Local Tourism using EfficientNet

Fudholi, Dhomas Hatta; Windiatmoko, Yurio; Afrianto, Nurdi; Susanto, Prastyo Eko; Suyuti, Magfirah; Hidayatullah, Ahmad Fathan; Rahmadi, Ridho

doi:10.1088/1757-899x/1077/1/012038

Cited by 7 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To produce the next word, adaptive attention [34] was used to determine when and at which part of the image should be focused on by using translated Microsoft Common Objects in Context (MS COCO) and Flickr30k datasets. Research [35] applied visual attention mechanism to their model to produce a caption for the image that makes greater sense. As the result, their model was able to give a sensible and detailed caption in the local tourism domain.…”

Section: Attention Mechanismmentioning

confidence: 99%

See 1 more Smart Citation

A study on attention-based deep learning architecture model for image captioning

Fudholi,

Al-Faruq,

Nayoan

et al. 2024

IJ-AI

View full text Add to dashboard Cite

<span lang="EN-US">Image captioning has been widely studied due to its ability in a visual scene understanding. Automatic visual scene understanding is useful for remote monitoring system and visually impaired people. Attention-based models, including transformer, are the current state-of-the-art architectures used in developing image captioning model. This study examines the works in the development of image captioning model, especially models that are developed based on attention mechanism. The architecture, the dataset, and the evaluation metrics analysis are done to the collected works. A general flow of image captioning model development is also presented. The literature search process carried out on Google Scholar. There are 36 literatures used in this study, including a specific image captioning development in Indonesian. It is done to take one point of view of image captioning development in a low resource language. Studies using transformer model generally achieves higher evaluation metric scores. In our finding, the highest evaluation scores on the consensus-based image description evaluation (CIDEr) c5 and c40 metrics are 138.5 and 140.5 respectively. This study gives a baseline on future development of image captioning model and brings the general concept of the image captioning development process including a picture of the development in low resource language.</span>

show abstract

Section: Attention Mechanismmentioning

confidence: 99%

“…Study with a specific domain, requires a specially made dataset because it has not been available before. Research [35] collected a total of 1,696 local tourism-related images from Google search engines.…”

Section: Indonesian Datasetsmentioning

confidence: 99%

A study on attention-based deep learning architecture model for image captioning

Fudholi,

Al-Faruq,

Nayoan

et al. 2024

IJ-AI

View full text Add to dashboard Cite

show abstract

“…However, during its development, the existing modeling was also trained on other datasets for more specific captioning tasks. The study [56] uses a dataset of images related to local tourism in Yogyakarta gathered from the Google search engine. This research aims to create a unique image captioning model for Yogyakarta tourism that can be expanded into a chatbot system.…”

Section: Introductionmentioning

confidence: 99%

Image captioning to aid blind and visually impaired outdoor navigation

Faurina

Jelita

Vatresia

et al. 2023

IJ-AI

View full text Add to dashboard Cite

Artificial intelligence technology has dramatically improved the quality of services for human needs, one of which is technology to improve the quality of services for the blind and visually impaired, particularly technology that can help them understand visual sights to facilitate navigation in their daily lives. This study developed an image captioning model to aid the blind and visually impaired in outdoor navigation. The image captioning model employs the encoder-decoder method, with the convolutional neural network (CNN) feature extraction and attention layer as encoders and the long short-term memory (LSTM) as decoders. ResNet101 and ResNet152 are used in the encoder to extract image features. The results of the extraction and caption are forwarded to the attention layer and the LSTM network. The attention layer uses the Bahdanau attention mechanism. The accuracy of the model is calculated using the bilingual evaluation understudy score (BLEU), metric for evaluation of translation with explicit ordering (METEOR) and recall-oriented understudy for gisting evaluation-longest common subsequence (ROUGE-L). ResNet101 performed the best on BLEU-4, scoring 91.811% and 94.0337% in the METEOR evaluation. The captioning results show that the model is quite successful in displaying a simple caption that is suitable for each image.

show abstract

“…We can find Image Captioning research in the Indonesian language using the existing dataset in research [12]- [15]. Meanwhile, Dhomas Hatta Fudholi conducted research for more specific applications, namely for local tourism image captioning [16] and household environment visual understanding [17], [18]. These studies were evaluated with several methods and showed that their evaluation scores were still low.…”

Section: Introductionmentioning

confidence: 99%

Exploring Pre-Trained Model and Language Model for Translating Image to Bahasa

Nurhopipah,

Suhaman,

Widianto

2023

Indonesian J. Comput. Cybern. Syst.

View full text Add to dashboard Cite

In the last decade, there have been significant developments in Image Caption Generation research to translate images into English descriptions. This task has also been conducted to produce texts in non-English, including Bahasa. However, the references in this study are still limited, so exploration opportunities are open widely. This paper presents comparative research by examining several state-of-the-art Deep Learning algorithms to extract images and generate their descriptions in Bahasa. We extracted images using three pre-trained models, namely InceptionV3, Xception, and EfficientNetV2S. In the language model, we examined four architectures: LSTM, GRU, Bidirectional LSTM, and Bidirectional GRU. The database used was Flickr8k which was translated into Bahasa. Model evaluation was conducted using BLEU and Meteor. The performance results based on the pre-trained model showed that EfficientNetV3S significantly gave the highest score among other models. On the other hand, in the language model, there was only a slight difference in model performance. However, in general, the Bidirectional GRU scored higher. We also found that step size in training affected overfitting. Larger step sizes tended to provide better generalizations. The best model was generated using EfficientNetV3S and Bidirectional GRU with step size=4096, which resulted in an average score of BLEU-1=0,5828 and Meteor=0,4520.

show abstract

Image Captioning with Attention for Smart Local Tourism using EfficientNet

Cited by 7 publications

References 13 publications

A study on attention-based deep learning architecture model for image captioning

A study on attention-based deep learning architecture model for image captioning

Image captioning to aid blind and visually impaired outdoor navigation

Exploring Pre-Trained Model and Language Model for Translating Image to Bahasa

Contact Info

Product

Resources

About