Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Wang, Yue; Li, Jing; Lyu, Michael R.; King, Irwin

doi:10.18653/v1/2020.emnlp-main.268

Cited by 6 publications

(21 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…the best performing EfficientNet model obtains 24.72 F1). This suggests that text encapsulates more relevant information for this task than images on their own, similar to other studies in multimodal computational social science (Wang et al, 2020;Ma et al, 2021).…”

Section: Resultssupporting

confidence: 85%

Point-of-Interest Type Prediction using Text and Images

Villegas¹,

Αλέτρας²

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Point-of-interest (POI) type prediction is the task of inferring the type of a place from where a social media post was shared. Inferring a POI's type is useful for studies in computational social science including sociolinguistics, geosemiotics, and cultural geography, and has applications in geosocial networking technologies such as recommendation and visualization systems. Prior efforts in POI type prediction focus solely on text, without taking visual information into account. However in reality, the variety of modalities, as well as their semiotic relationships with one another, shape communication and interactions in social media. This paper presents a study on POI type prediction using multimodal information from text and images available at posting time. For that purpose, we enrich a currently available data set for POI type prediction with the images that accompany the text messages. Our proposed method extracts relevant information from each modality to effectively capture interactions between text and image achieving a macro F1 of 47.21 across eight categories significantly outperforming the state-of-the-art method for POI type prediction based on textonly methods. Finally, we provide a detailed analysis to shed light on cross-modal interactions and the limitations of our best performing model. 1

show abstract

Section: Resultssupporting

confidence: 85%

Point-of-Interest Type Prediction using Text and Images

Villegas¹,

Αλέτρας²

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Compared with the previous models [43,53,54], our model possesses the following two advantages. First, we introduce visual entities that are semantically related to the input image and can be served as anchor points for cross-modal semantic alignment.…”

Section: Introductionmentioning

confidence: 99%

“…Typically, these studies adopt a coattention network to fuse textual and visual tweet information for recommending hashtags [53,54]. Unlike the studies mentioned above, Wang et al [43] first perform Optical Character Recognition (OCR) to extract explicit optical characters from the input image and then utilize an image captioning model to extract implicit image attributes that reflect the semantic information of the image. To better integrate multi-modal information, they then introduce a multimodal multi-head attention to model the semantic interactions between different modalities.…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we propose a novel multi-modal keyphrase generation model with visual entity enhancement and image noise filtering. Our model is a significant extension of [43]. Our model not only introduces external visual entities as supplementary information of the textual input, but also leverages multi-granularity noise filtering to effectively exploit the input image.…”

Section: Introductionmentioning

confidence: 99%

“…In this aspect, the common practice [53,54] use a co-attention network to fuse textual and visual information and then recommend tags for multi-modal tweets. Wang et al [43] propose a multi-modal keyphrase generation model based on an encoder-decoder framework. Typically, the encoder is equipped with a multi-head attention mechanism to fuse multi-modal information, and the decoder is a pointer network.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

Dong,

Wu,

Meng

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.

show abstract