2021
DOI: 10.3390/app11188354
|View full text |Cite
|
Sign up to set email alerts
|

An Attentive Fourier-Augmented Image-Captioning Transformer

Abstract: Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision–language models. In this paper, we make the images matter more by using fast Fouri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 37 publications
0
3
0
Order By: Relevance
“…The encoder-decoder paradigm has been proposed whereas a global-enhanced encoder first encodes the original inputs into highly abstract localised representations and then extracts the intra-and inter-layer global representation. The decoder then implements the suggested global adaptive controller to iteratively incorporate the multimodal information while producing the caption word by word [52]. An Attentive Fourier-Augmented Image Captioning Transformer (AFCT) based methodology has been proposed by the researchers.…”
Section: Analysis Using State-of-the-art Methodsmentioning
confidence: 99%
“…The encoder-decoder paradigm has been proposed whereas a global-enhanced encoder first encodes the original inputs into highly abstract localised representations and then extracts the intra-and inter-layer global representation. The decoder then implements the suggested global adaptive controller to iteratively incorporate the multimodal information while producing the caption word by word [52]. An Attentive Fourier-Augmented Image Captioning Transformer (AFCT) based methodology has been proposed by the researchers.…”
Section: Analysis Using State-of-the-art Methodsmentioning
confidence: 99%
“…It depends on the attention mechanism, specifically self-attention. Osolo et al [80] applied fast Fourier transforms to decompose the input features and extract more important information from the images to provide succinct and informative captions. To distinguish between the word semantics and grammatical structures of captions and include the PoS guiding information in the modeling, Wang et al [81] proposed a novel part-of-speech guided transformer (PoS-Transformer).…”
Section: Transformermentioning
confidence: 99%
“…As a result, deep learning models have been getting more complicated, bigger and a lot of the state-of-the-art performance of a number of models such as GPT, BERT can be attributed to the amount data used to train them. Some methods using simpler architectures have been proposed to reverse this trend such as using Fourier transforms and fully connected layers [7,33] to get performances close to the state-of-the-art.…”
Section: Image Captioningmentioning
confidence: 99%