Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence 2022
DOI: 10.24963/ijcai.2022/759
|View full text |Cite
|
Sign up to set email alerts
|

Image-text Retrieval: A Survey on Recent Research and Development

Abstract: Research in cognitive science has provided extensive evidence of human cognitive ability in performing physical reasoning of objects from noisy perceptual inputs. Such a cognitive ability is commonly known as intuitive physics. With advancements in deep learning, there is an increasing interest in building intelligent systems that are capable of performing physical reasoning from a given scene for the purpose of building better AI systems. As a result, many contemporary approaches in modelling intuitive physic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 36 publications
(10 citation statements)
references
References 2 publications
0
10
0
Order By: Relevance
“…These models often have deep network backbones like BERT (Devlin et al, 2019 ) and are trained with loss functions such as masked language modeling, masked image region prediction and image-text matching. Recent work on image-text matching and retrieval (Cao et al, 2022 ) has explicitly focused on refining cross-attention mechanism and local alignment to get better retrieval performance. Consequently, these models focus more toward aligning literal (objective relations) relations than abstract and high-level semantic relations (semantic correlation STATUS, modification).…”
Section: Discussionmentioning
confidence: 99%
“…These models often have deep network backbones like BERT (Devlin et al, 2019 ) and are trained with loss functions such as masked language modeling, masked image region prediction and image-text matching. Recent work on image-text matching and retrieval (Cao et al, 2022 ) has explicitly focused on refining cross-attention mechanism and local alignment to get better retrieval performance. Consequently, these models focus more toward aligning literal (objective relations) relations than abstract and high-level semantic relations (semantic correlation STATUS, modification).…”
Section: Discussionmentioning
confidence: 99%
“…It also delves into experimental benchmarks, metrics, and performances, proffering novel ideas and recommendations for forthcoming research directions. Despite previous literature [7]- [10] addressing cross-modal retrieval, it is riddled with several deficiencies in timeliness, taxonomy, comprehensiveness, and more. Concretely, literature [7], [8] furnish insights into the early stages of cross-modal retrieval, yet their portrayal of representative methods and contemporary advancements is hindered by temporal gaps.…”
Section: Introductionmentioning
confidence: 99%
“…Notably, the recent emergence of Transformer architectures and vision-language pre-training models has wielded a profound impact on the domain of deep learning, fundamentally reshaping the cross-modal retrieval research landscape. While recent years have seen the publication of literature [9], [10], their scope and taxonomy fall markedly short. Within literature [9], the discourse on cross-modal retrieval methods rooted in self-attention mechanisms or large-scale pre-trained models is notably sparse.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations