2019
DOI: 10.1109/tip.2018.2881928
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Abstract: Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
43
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 75 publications
(43 citation statements)
references
References 43 publications
0
43
0
Order By: Relevance
“…Many researchers have pointed out that the per-word metrics are biased toward infrequent labels, because making them correct could have a very significant impact on final accuracy [24]. Therefore they propose overall metrics (sometimes called as per-image metrics) [24], [29]- [31]. The overall metrics are defined as,…”
Section: B Evaluation Metricsmentioning
confidence: 99%
See 1 more Smart Citation
“…Many researchers have pointed out that the per-word metrics are biased toward infrequent labels, because making them correct could have a very significant impact on final accuracy [24]. Therefore they propose overall metrics (sometimes called as per-image metrics) [24], [29]- [31]. The overall metrics are defined as,…”
Section: B Evaluation Metricsmentioning
confidence: 99%
“…In essence, as far as image understanding is concerned, the overall metrics are much better than per-word metrics, since the overall metrics pay more attention to image content understanding in the perspective of each image. As a result, in common with per-word metrics, the overall metrics are considered as appropriate metrics for comparing image annotation methods [24], [29]- [31]. In addition, we also introduce hybrid F1-measure (called H-F1) combining per-word F1-measure and overall F1-measure with the harmonic mean [29].…”
Section: E Further Evaluationmentioning
confidence: 99%
“…), similar to a previous approach. 16 Second, we provide the model with more power by adding an FC layer after concatenation which allows the model to learn a nonlinear combination of the features (Dense-GRU FC). Third, we do not combine the features explicitly but instead let them learn interactions between each other by employing a cross-attention module (Dense-GRU CA).…”
Section: Modelsmentioning
confidence: 99%
“…ese solutions treat the image annotation problem as an image-to-text translation problem and solve it using an encoder-decoder model. e multiscale approaches [18] propose a novel multiscale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Instead of CNN features, some works use more semantic information obtained from the image as the input to the decoder [19,20].…”
Section: Image Taggingmentioning
confidence: 99%