2022
DOI: 10.1609/aaai.v36i6.20614
|View full text |Cite
|
Sign up to set email alerts
|

Learning Aligned Cross-Modal Representation for Generalized Zero-Shot Classification

Abstract: Learning a common latent embedding by aligning the latent spaces of cross-modal autoencoders is an effective strategy for Generalized Zero-Shot Classification (GZSC). However, due to the lack of fine-grained instance-wise annotations, it still easily suffer from the domain shift problem for the discrepancy between the visual representation of diversified images and the semantic representation of fixed attributes. In this paper, we propose an innovative autoencoder network by learning Aligned Cross-Modal Repres… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(2 citation statements)
references
References 37 publications
0
2
0
Order By: Relevance
“…In response to the intricacy of multimodal environments, Wu et al [25] proposed a two-stage learning approach that effectively interprets textual scenes and surrounding visual information. Fang et al [26] introduced an innovative autoencoder network that explores semantic disparities between visual representations and text through reconstruction constraints across modalities. Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments.…”
Section: Auto-encodermentioning
confidence: 99%
“…In response to the intricacy of multimodal environments, Wu et al [25] proposed a two-stage learning approach that effectively interprets textual scenes and surrounding visual information. Fang et al [26] introduced an innovative autoencoder network that explores semantic disparities between visual representations and text through reconstruction constraints across modalities. Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments.…”
Section: Auto-encodermentioning
confidence: 99%
“…A widely utilized method in vision-language pre-trained models is the cross-modal representation learning (Fang et al 2022;Li et al 2020;Wehrmann, Kolling, and Barros 2020;Zheng et al 2020). Inspired by the advantages of the cross-modal alignment and Bayesian methods (Lin et al 2022) for alleviating overfitting, instead of focusing on improving learning algorithms for learning OoD generalizable image features, we propose the Bayesian cross-modal alignment learning method (Bayes-CAL) to achieve fewshot OoD generalization.…”
Section: Introductionmentioning
confidence: 99%