2020
DOI: 10.1109/tip.2020.2969330
|View full text |Cite
|
Sign up to set email alerts
|

Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
35
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 78 publications
(35 citation statements)
references
References 31 publications
0
35
0
Order By: Relevance
“…The authors also have proposed to use the Convolutional GRU to calculate the assignment to investigate the spatio-temporal correlation at each time-stamp between the successive frames. Huang et al [44] have proposed an image captioning model which comes with End-to-End Attribute Detection and Subsequent Attributes Prediction. In this model, features are extracted using ResNet-101 based Faster-RCNN.…”
Section: Feature Extractionmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors also have proposed to use the Convolutional GRU to calculate the assignment to investigate the spatio-temporal correlation at each time-stamp between the successive frames. Huang et al [44] have proposed an image captioning model which comes with End-to-End Attribute Detection and Subsequent Attributes Prediction. In this model, features are extracted using ResNet-101 based Faster-RCNN.…”
Section: Feature Extractionmentioning
confidence: 99%
“…Wang et al [126] have proposed model has two layers, and each layer has one GRU in the decoding phase. Huang et al [44] have proposed to implement Two layers SAP-LSTM for decoding purposes.…”
Section: Sentence Generationmentioning
confidence: 99%
“…These works could be sorted based on the attention network structure and the attention weight calculation method. (1) Single layer vs. multi-layer (see, for example, studies [3,3,4,[19][20][21]) employs a single layer implement of an attention mechanism by taking the hidden state as the query vector to extract visual features at each step, while the studies [8,16,17,22,23] chose a multi-layer attention implementation in their decoder. (2) Involving extra clues in attention weights or not-for example, studies [11,24,25] obtained their attention weights only from previous attention calculations, while the authors of [18] combined geometry clues with previously calculated weights into final attention weights.…”
Section: Related Workmentioning
confidence: 99%
“…The application of image captioning is very broad and significant, especially in the field of human and computer interaction. Apart from this, practical image captioning helps disabled people interact with normal people [6]. It is also used to create multimedia content descriptions, assist e-commerce companies with digital marketing, and create online news stories for news content narratives [7].…”
Section: Introductionmentioning
confidence: 99%