Dual Compositional Learning in Interactive Image Retrieval

Kim, Jong-Seok; Yu, Yang; Kim, Hoeseong; Kim, Gun-Hee

doi:10.1609/aaai.v35i2.16271

Cited by 55 publications

(27 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 3 shows the quantitative results on Fashion-IQ validation set. Our approach outperforms the state-of-the-art by improving up to ∼ 5% in average R@10 and 3% in average R@50 upon the best method, DCNet [17], when using the CLIP RN50x4 backbone. Our method have the highest recall in the Shirt and Toptee categories, with comparable performance in the Dress category, using both backbones.…”

Section: Comparison With Sotamentioning

confidence: 88%

“…A schema of the complete system is shown in Figure 1 on page 1. In contrast to previous works like [6,17,19,24] that build from different image and textual model, we start from the hypothesis of having a common embedding of images and text, realized by CLIP. As shown in [22], similar concepts expressed in text and images tend to share similar features, or at least be "near" in the common space.…”

Section: The Proposed Methodsmentioning

confidence: 99%

“…In [19], image style and content are considered separately by two different neural network modules. In [17] a Correction Network is added which explicitly models the difference between the reference and target image in the embedding space.…”

Section: Previous Workmentioning

confidence: 99%

“…Instead, in [19,27] features extracted from the backbone are 3-dimensional and the composition takes care of spatial information, in [6] the features are extracted at different convolutional layers from the ResNet-50 backbone. In [17] the authors divided the image and the sentence into a set of localized components assigning a representation module, denoted as experts, to each of them. More similar to our work is [24] which trains a combiner directly on flattened image and text features that, differently from our work, are obtained from different embeddings.…”

Section: Previous Workmentioning

confidence: 99%

“…Dress Toptee Average Method R@10 R@50 R@10 R@50 R@10 R@50 R@10 R@50 JVSM [5] 12 We follow experimental setting as in [17,19]. We employ the average recall at rank K (Recall@K) as evaluation metric, namely Recall@10 (R@10) and Recall@50 (R@50).…”

Section: Shirtmentioning

confidence: 99%

See 4 more Smart Citations