2022
DOI: 10.1109/access.2022.3161428
|View full text |Cite
|
Sign up to set email alerts
|

Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Abstract: This paper focuses on visual attention, a state-of-the-art approach for image captioning tasks within the computer vision research area. We study the impact that different hyperparemeter configurations on an encoder-decoder visual attention architecture in terms of efficiency. Results show that the correct selection of both the cost function and the gradient-based optimizer can significantly impact the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(12 citation statements)
references
References 25 publications
0
12
0
Order By: Relevance
“…CNN compared the target image against a huge dataset of training images, after producing a precise explanation with the help of trained captions. The research scholars in the study conducted earlier [16] aimed at visual attention for which they proposed an advanced technique for image captioning in computer vision research zone. The researchers understood the influence exerted by distinct hyper-parameters over encoder-decoder visual attention structure with regards to efficiency.…”
Section: Literature Reviewmentioning
confidence: 99%
“…CNN compared the target image against a huge dataset of training images, after producing a precise explanation with the help of trained captions. The research scholars in the study conducted earlier [16] aimed at visual attention for which they proposed an advanced technique for image captioning in computer vision research zone. The researchers understood the influence exerted by distinct hyper-parameters over encoder-decoder visual attention structure with regards to efficiency.…”
Section: Literature Reviewmentioning
confidence: 99%
“…First, the use of depthwise convolution. We only introduce additional 2 sC parameters and 2 () O s CT FLOPs as compared to the linear projection, which is negligible as compared to the total number of parameters and FLOPs in the models. Second, the process of matric sharing S. With this improvement, the number of parameters of key and value are reduced by half.…”
Section: B Convolutional Parameters Sharing Multi-head Attention (Cpsa)mentioning
confidence: 99%
“…Transformers [1], [2] have become a de-facto standard in deep learning and have been widely adopted in various fields. These models have been widely adopted in modern deep learning, such as natural language processing (NLP) [3], [4], [5], computer vision (CV) [6], [7], [8], [9], and speech processing [10], [11], [12], due to their ability to model longrange dependencies.…”
Section: Introductionmentioning
confidence: 99%
“…Currently, computer vision (CV) tasks are useful for solving problems related to object detection, classification, object counting, visual surveillance, etc., taking advantage of video resources from public surveillance cameras located in many public areas (i.e., shopping malls, supermarkets, airports, train stations, stadiums, etc.) [9][10][11][12]. The problem of the correct/incorrect wearing of face masks implies two CV tasks: (1) object detection and (2) object classification.…”
Section: Introductionmentioning
confidence: 99%