Attention-Aligned Transformer for Image Captioning

Fei, Zhengcong

doi:10.1609/aaai.v36i1.19940

Cited by 25 publications

(8 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2022) , ViTCAP, an image captioning model based on a pure visual transformer, is proposed, in which a grid representation is used without extracting regional features. In the study of Fei (2022) , an attention-aligned converter for image captions is proposed, called A 2 , which is a perturbation-based, self-supervised way to guide attention learning without any annotation overhead. In the study of Liu et al.…”

Section: Related Workmentioning

confidence: 99%

DIC-Transformer: interpretation of plant disease classification results using image caption generation technology

Zeng,

Sun,

Wang

2024

Front. Plant Sci.

View full text Add to dashboard Cite

Disease image classification systems play a crucial role in identifying disease categories in the field of agricultural diseases. However, current plant disease image classification methods can only predict the disease category and do not offer explanations for the characteristics of the predicted disease images. Due to the current situation, this paper employed image description generation technology to produce distinct descriptions for different plant disease categories. A two-stage model called DIC-Transformer, which encompasses three tasks (detection, interpretation, and classification), was proposed. In the first stage, Faster R-CNN was utilized to detect the diseased area and generate the feature vector of the diseased image, with the Swin Transformer as the backbone. In the second stage, the model utilized the Transformer to generate image captions. It then generated the image feature vector, which is weighted by text features, to improve the performance of image classification in the subsequent classification decoder. Additionally, a dataset containing text and visualizations for agricultural diseases (ADCG-18) was compiled. The dataset contains images of 18 diseases and descriptive information about their characteristics. Then, using the ADCG-18, the DIC-Transformer was compared to 11 existing classical caption generation methods and 10 image classification models. The evaluation indicators for captions include Bleu1–4, CiderD, and Rouge. The values of BLEU-1, CIDEr-D, and ROUGE were 0.756, 450.51, and 0.721. The results of DIC-Transformer were 0.01, 29.55, and 0.014 higher than those of the highest-performing comparison model, Fc. The classification evaluation metrics include accuracy, recall, and F1 score, with accuracy at 0.854, recall at 0.854, and F1 score at 0.853. The results of DIC-Transformer were 0.024, 0.078, and 0.075 higher than those of the highest-performing comparison model, MobileNetV2. The results indicate that the DIC-Transformer outperforms other comparison models in classification and caption generation.

show abstract

Section: Related Workmentioning

confidence: 99%

DIC-Transformer: interpretation of plant disease classification results using image caption generation technology

Zeng,

Sun,

Wang

2024

Front. Plant Sci.

View full text Add to dashboard Cite

show abstract

“…Image Captioning. In recent years, a large number of neural systems have been proposed for the image captioning task [3,9,16,22,24,40,53,58]. The state-of-the-art approaches depend on the encoder-decoder framework to translate the image into a descriptive sentence.…”

Section: Related Workmentioning

confidence: 99%

Efficient Modeling of Future Context for Image Captioning

Fei

2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption. CCS CONCEPTS• Computing methodologies → Computer vision; Natural language processing.

show abstract

“…Image captioning, which aims to generate textual descriptions of input images, is a critical task in multimedia analysis (Stefanini et al 2021). Previous works in this area are mostly based on an encoder-decoder paradigm (Vinyals et al 2015;Xu et al 2015;Rennie et al 2017;Anderson et al 2018;Huang et al 2019;Cornia et al 2020;Pan et al 2020;Fei 2022;Li et al 2022;Yang, Liu, and Wang 2022), where a convolution-neural-network-based image encoder first process an input image into visual representations, and then a recursive-neural-network or Transformer-based language decoder produces a corresponding caption based on these extracted features. The generation process usually relies on a chain-rule factorization and is performed in an autoregressive manner, i.e., words by words from left to right.…”

Section: Introductionmentioning

confidence: 99%

Uncertainty-Aware Image Captioning

Fei¹,

Fan²,

Li³

et al. 2023

AAAI

View full text Add to dashboard Cite

It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.

show abstract

Attention-Aligned Transformer for Image Captioning

Cited by 25 publications

References 38 publications

DIC-Transformer: interpretation of plant disease classification results using image caption generation technology

DIC-Transformer: interpretation of plant disease classification results using image caption generation technology

Efficient Modeling of Future Context for Image Captioning

Uncertainty-Aware Image Captioning

Contact Info

Product

Resources

About