2021
DOI: 10.48550/arxiv.2112.06482
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Abstract: Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 27 publications
0
8
0
Order By: Relevance
“…Therefore, the performance of named entity recognition models in practical data is not ideal. In recent years, many studies on multimodal named entity recognition have incorporated text corresponding images as supplementary information into text fusion [12][13][14][15] to improve the problem of information accuracy. However, these studies did not pay attention to the large amount of noise generated by irrelevant image information.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, the performance of named entity recognition models in practical data is not ideal. In recent years, many studies on multimodal named entity recognition have incorporated text corresponding images as supplementary information into text fusion [12][13][14][15] to improve the problem of information accuracy. However, these studies did not pay attention to the large amount of noise generated by irrelevant image information.…”
Section: Related Workmentioning
confidence: 99%
“…They employ diverse cross-modal attention mechanisms to facilitate the interaction between text and images. Recently, Wang et al (2021a) points out that the performance limitations of such methods are largely attributed to the disparities in distribution between different modalities. Despite Wang et al (2022c) try to mitigate the aforementioned issues by using further refining cross-modal attention, training this end-to-end cross-modal Transformer architectures imposes significant demands on computational resources.…”
Section: Multimodal Named Entity Recognitionmentioning
confidence: 99%
“…Despite Wang et al (2022c) try to mitigate the aforementioned issues by using further refining cross-modal attention, training this end-to-end cross-modal Transformer architectures imposes significant demands on computational resources. Due to the aforementioned limitations, ITA (Wang et al, 2021a) and MoRe (Wang et al, 2022a) attempt to use a new paradigm to address MNER. ITA circumvents the challenge of multi-modal alignment by forsaking the utilization of raw visual features and opting for OCR and image captioning techniques to convey image information.…”
Section: Multimodal Named Entity Recognitionmentioning
confidence: 99%
See 2 more Smart Citations