Learning Aligned Cross-Modal Representation for Generalized Zero-Shot Classification

Fang, Zhiyu; Zhu, Xiaobin; Yang, Chun; Zheng, Han; Qin, Jingyan; Yin, Xu-Cheng

doi:10.1609/aaai.v36i6.20614

Cited by 12 publications

(2 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In response to the intricacy of multimodal environments, Wu et al [25] proposed a two-stage learning approach that effectively interprets textual scenes and surrounding visual information. Fang et al [26] introduced an innovative autoencoder network that explores semantic disparities between visual representations and text through reconstruction constraints across modalities. Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments.…”

Section: Auto-encodermentioning

confidence: 99%

A review of cross-modal retrieval for image-text

Xia,

Yang,

et al. 2024

Fifteenth International Conference on Graphics and Image Processing (ICGIP 2023)

View full text Add to dashboard Cite

With the rapid advancement of Internet technology and the widespread adoption of smart devices, there has been a substantial increase in multimodal data that conveys identical semantics but in diverse coding formats. To foster the advancement of social intelligence, scholars are increasingly investigating the semantic correlations among multimodal data, which represents a current research focal point. The primary objective of cross-modal accurately compute the dissimilar modalities and efficiently retrieve relevant data from other modalities. The objective of this article is to provide comprehensive overview of the advancements in cross-modal retrieval research. First, it presents a conceptual framework and problem formulation for cross-modal retrieval elucidating, the multimodal nature of image and text cross-modal retrieval. Secondly, it delves into semantic representation learning-based approaches for computing imagetext cross-modal similarity and hash-based methods for facilitating cross-modal retrieval. Furthermore, a comparative analysis is conducted on widely adopted evaluation metrics for current cross-modal retrieval techniques, accompanied by outlook on future research directions.

show abstract

Section: Auto-encodermentioning

confidence: 99%

A review of cross-modal retrieval for image-text

Xia,

Yang,

et al. 2024

Fifteenth International Conference on Graphics and Image Processing (ICGIP 2023)

View full text Add to dashboard Cite

show abstract

“…A widely utilized method in vision-language pre-trained models is the cross-modal representation learning (Fang et al 2022;Li et al 2020;Wehrmann, Kolling, and Barros 2020;Zheng et al 2020). Inspired by the advantages of the cross-modal alignment and Bayesian methods (Lin et al 2022) for alleviating overfitting, instead of focusing on improving learning algorithms for learning OoD generalizable image features, we propose the Bayesian cross-modal alignment learning method (Bayes-CAL) to achieve fewshot OoD generalization.…”

Section: Introductionmentioning

confidence: 99%

Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization

Zhu

Wang

Zhou

et al. 2023

AAAI

View full text Add to dashboard Cite

Recent advances in large pre-trained models showed promising results in few-shot learning. However, their generalization ability on two-dimensional Out-of-Distribution (OoD) data, i.e., correlation shift and diversity shift, has not been thoroughly investigated. Researches have shown that even with a significant amount of training data, few methods can achieve better performance than the standard empirical risk minimization method (ERM) in OoD generalization. This few-shot OoD generalization dilemma emerges as a challenging direction in deep neural network generalization research, where the performance suffers from overfitting on few-shot examples and OoD generalization errors. In this paper, leveraging a broader supervision source, we explore a novel Bayesian cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Specifically, the model is designed as only text representations are fine-tuned via a Bayesian modelling approach with gradient orthogonalization loss and invariant risk minimization (IRM) loss. The Bayesian approach is essentially introduced to avoid overfitting the base classes observed during training and improve generalization to broader unseen classes. The dedicated loss is introduced to achieve better image-text alignment by disentangling the causal and non-casual parts of image features. Numerical experiments demonstrate that Bayes-CAL achieved state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Moreover, compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes. Our code is available at https://github.com/LinLLLL/BayesCAL.

show abstract