2021
DOI: 10.1609/aaai.v35i2.16271
|View full text |Cite
|
Sign up to set email alerts
|

Dual Compositional Learning in Interactive Image Retrieval

Abstract: We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 55 publications
(27 citation statements)
references
References 31 publications
0
27
0
Order By: Relevance
“…Table 3 shows the quantitative results on Fashion-IQ validation set. Our approach outperforms the state-of-the-art by improving up to ∼ 5% in average R@10 and 3% in average R@50 upon the best method, DCNet [17], when using the CLIP RN50x4 backbone. Our method have the highest recall in the Shirt and Toptee categories, with comparable performance in the Dress category, using both backbones.…”
Section: Comparison With Sotamentioning
confidence: 88%
See 4 more Smart Citations
“…Table 3 shows the quantitative results on Fashion-IQ validation set. Our approach outperforms the state-of-the-art by improving up to ∼ 5% in average R@10 and 3% in average R@50 upon the best method, DCNet [17], when using the CLIP RN50x4 backbone. Our method have the highest recall in the Shirt and Toptee categories, with comparable performance in the Dress category, using both backbones.…”
Section: Comparison With Sotamentioning
confidence: 88%
“…A schema of the complete system is shown in Figure 1 on page 1. In contrast to previous works like [6,17,19,24] that build from different image and textual model, we start from the hypothesis of having a common embedding of images and text, realized by CLIP. As shown in [22], similar concepts expressed in text and images tend to share similar features, or at least be "near" in the common space.…”
Section: The Proposed Methodsmentioning
confidence: 99%
See 3 more Smart Citations