ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Wang, Xinyu; Gui, Min; Jiang, Yong; Jia, Zixia; Bach, Nguyen; Wang, Tao; Huang, Zhigao; Huang, Fei; Tu, Kewei

doi:10.48550/arxiv.2112.06482

Cited by 4 publications

(8 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, the performance of named entity recognition models in practical data is not ideal. In recent years, many studies on multimodal named entity recognition have incorporated text corresponding images as supplementary information into text fusion [12][13][14][15] to improve the problem of information accuracy. However, these studies did not pay attention to the large amount of noise generated by irrelevant image information.…”

Section: Related Workmentioning

confidence: 99%

A Multimodal Named Entity Recognition Model for Power Equipment Based on Deep Neural Network

Zhang,

Song,

Zhao

et al. 2023

Advances in Transdisciplinary Engineering

View full text Add to dashboard Cite

Digital empowerment of China’s power energy sector is a key factor in increasing its economic and social benefits, and named entity recognition technology is the most fundamental and core task of information extraction technology in the digital empowerment process. Therefore, we propose a multimodal named entity recognition model PE-MNER for power equipment based on deep neural networks. Compared to text multimodality, text and image multimodality can use image information to supplement missing information in the text, thus enabling more accurate entity extraction. The model first obtains a BERT neural network through incremental training, and then extracts Chinese character features through the network. Then, a hierarchical visual prefix fusion network is used to fuse image information. From the comparative experimental results, it can be seen that the proposed model has the best performance compared to the benchmark model, with an improvement of 4.08%∼7.20% in the F1 score compared to the benchmark model.

show abstract

Section: Related Workmentioning

confidence: 99%

A Multimodal Named Entity Recognition Model for Power Equipment Based on Deep Neural Network

Zhang,

Song,

Zhao

et al. 2023

Advances in Transdisciplinary Engineering

View full text Add to dashboard Cite

show abstract

“…They employ diverse cross-modal attention mechanisms to facilitate the interaction between text and images. Recently, Wang et al (2021a) points out that the performance limitations of such methods are largely attributed to the disparities in distribution between different modalities. Despite Wang et al (2022c) try to mitigate the aforementioned issues by using further refining cross-modal attention, training this end-to-end cross-modal Transformer architectures imposes significant demands on computational resources.…”

Section: Multimodal Named Entity Recognitionmentioning

confidence: 99%

“…Despite Wang et al (2022c) try to mitigate the aforementioned issues by using further refining cross-modal attention, training this end-to-end cross-modal Transformer architectures imposes significant demands on computational resources. Due to the aforementioned limitations, ITA (Wang et al, 2021a) and MoRe (Wang et al, 2022a) attempt to use a new paradigm to address MNER. ITA circumvents the challenge of multi-modal alignment by forsaking the utilization of raw visual features and opting for OCR and image captioning techniques to convey image information.…”

Section: Multimodal Named Entity Recognitionmentioning

confidence: 99%

“…The version of ChatGPT used in experiments is gpt-3.5-turbo and sampling temperature is set to 0. For a fair comparison, PGIM chooses to use the same text encoder XLM-RoBERTa large (Conneau et al, 2019) as ITA (Wang et al, 2021a), PromptM-NER (Wang et al, 2022b), CAT-MNER (Wang et al, 2022c) and MoRe (Wang et al, 2022a).…”

Section: Stage-2 Entity Prediction Based On Auxiliary Refined Knowledgementioning

confidence: 99%

“…In addition, recent studies (Wang et al, 2021b; has shown that introducing additional document-level context on the basis of the original text can significantly improve the performance of NER models. Therefore, recent studies (Wang et al, 2021a(Wang et al, , 2022a aim to solve the MNER task using the Text-Text (T+T) paradigm. In these approaches, images are reasonably converted into textual representations through techniques such as image caption and optical character recognition (OCR).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

Li,

Pan

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Multimodal Named Entity Recognition (MNER) on social media aims to enhance textual entity prediction by incorporating image-based clues. Existing studies mainly focus on maximizing the utilization of pertinent image information or incorporating external knowledge from explicit knowledge bases. However, these methods either neglect the necessity of providing the model with external knowledge, or encounter issues of high redundancy in the retrieved knowledge. In this paper, we present PGIM -a two-stage framework that aims to leverage ChatGPT as an implicit knowledge base and enable it to heuristically generate auxiliary knowledge for more efficient entity prediction. Specifically, PGIM contains a Multimodal Similar Example Awareness module that selects suitable examples from a small number of predefined artificial samples. These examples are then integrated into a formatted prompt template tailored to the MNER and guide ChatGPT to generate auxiliary refined knowledge. Finally, the acquired knowledge is integrated with the original text and fed into a downstream model for further processing. Extensive experiments show that PGIM outperforms state-of-the-art methods on two classic MNER datasets and exhibits a stronger robustness and generalization capability. 1

show abstract

DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual Enhancements

Liu,

Yang,

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Multimodal named entity recognition plays a crucial role in the construction process of knowledge graphs as it directly influences the quality of entity extraction and classification, which in turn affects the overall quality of knowledge graph construction. However, most existing multimodal named entity recognition algorithms do not consider the correlation between text and images. They either use visual features of images as the attention of the text modality or fuse them with textual features. In the case of multimodal tweets containing both text and images, three categories of data can be identified based on the correlation between the two: text that is related to images, text that is partially related to images, and text that is not related to images. Using irrelevant or partially relevant image features as text cross-modal attention can result in incorrect text representation, ultimately leading to misclassification of entities and negatively impacting the model's performance. To address the problem of uncertainty or negative impact caused by the lack of relevance or partial correlation between text and images, this paper proposes a visually enhanced text representation algorithm based on a hybrid of dynamic gating and correlation coefficient. We conducted experiments on two benchmark datasets, namely Twitter-2015 and Twitter-2017. The experimental results were analyzed comprehensively to showcase the strengths of the proposed model.

show abstract

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

Cited by 4 publications

References 27 publications

A Multimodal Named Entity Recognition Model for Power Equipment Based on Deep Neural Network

A Multimodal Named Entity Recognition Model for Power Equipment Based on Deep Neural Network

Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge

DGHC: A Hybrid Algorithm for Multi-Modal Named Entity Recognition Using Dynamic Gating and Correlation Coefficients With Visual Enhancements

Contact Info

Product

Resources

About