Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475659
|View full text |Cite
|
Sign up to set email alerts
|

Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…We compared TG-CIR with the following baselines: TIRG [32], VAL [5], CIRPLANT [24], CosMo [21], DATIR [11], MCR [38], CLVC-Net [35], ARTEMIS [7], EER [37], FashionViL [16], CRR [36], AMC [41], Clip4cir [1], and FAME-ViL [16]. Notably, the former twelve baselines use traditional models like ResNet [18] and LSTM [20] as the feature extraction backbone, while the latter two take advantage of the multimodal pre-trained large model CLIP.…”
Section: Performance Comparisonmentioning
confidence: 99%
See 1 more Smart Citation
“…We compared TG-CIR with the following baselines: TIRG [32], VAL [5], CIRPLANT [24], CosMo [21], DATIR [11], MCR [38], CLVC-Net [35], ARTEMIS [7], EER [37], FashionViL [16], CRR [36], AMC [41], Clip4cir [1], and FAME-ViL [16]. Notably, the former twelve baselines use traditional models like ResNet [18] and LSTM [20] as the feature extraction backbone, while the latter two take advantage of the multimodal pre-trained large model CLIP.…”
Section: Performance Comparisonmentioning
confidence: 99%
“…For multimodal query composition, existing methods devote to design various neural networks [1,3,5,7,12,13,21,32,35,38] to compose the multimodal query, but overlook to model the intrinsic conflicting relationship between the multimodal query. Figure 1 illustrates an example of multimodal query, where the reference image indicates that the user may want a white princess dress, while the modification text specifies that the user wants to change the color and style of the reference image to "black" and "more elegant", respectively.…”
Section: Introductionmentioning
confidence: 99%
“…intermodal attention (Wehrmann et al, 2020) and selfattention mechanisms (Han et al, 2021). Zhang et al (2021b) design a cross-modal guided pooling module that attends to local information dynamically. These sophisticated aggregators typically require more time, and don't always outperform simple pooling strategies.…”
Section: Related Workmentioning
confidence: 99%
“…A compositor plays a fundamental role to integrate the textual information with the imagery modality. TGR compositors have been proposed based on various techniques, such as gating mechanism [49], hierarchical attention [7,23,12,20], graph neural network [54,44], joint learning [6,27,44,52,55], ensemble learning [50], style-content modification [29,5] and vision & language pre-training [32].…”
Section: Related Workmentioning
confidence: 99%