Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.268
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Abstract: Social media produces large amounts of contents every day. To help users quickly capture what they need, keyphrase prediction is receiving a growing attention. Nevertheless, most prior efforts focus on text modeling, largely ignoring the rich features embedded in the matching images. In this work, we explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. To better align social media style texts and images, we propose: (1) a novel Multi-Modality Multi-Head Attention (M… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(21 citation statements)
references
References 39 publications
3
18
0
Order By: Relevance
“…the best performing EfficientNet model obtains 24.72 F1). This suggests that text encapsulates more relevant information for this task than images on their own, similar to other studies in multimodal computational social science (Wang et al, 2020;Ma et al, 2021).…”
Section: Resultssupporting
confidence: 85%
“…the best performing EfficientNet model obtains 24.72 F1). This suggests that text encapsulates more relevant information for this task than images on their own, similar to other studies in multimodal computational social science (Wang et al, 2020;Ma et al, 2021).…”
Section: Resultssupporting
confidence: 85%
“…Compared with the previous models [43,53,54], our model possesses the following two advantages. First, we introduce visual entities that are semantically related to the input image and can be served as anchor points for cross-modal semantic alignment.…”
Section: Introductionmentioning
confidence: 99%
“…Typically, these studies adopt a coattention network to fuse textual and visual tweet information for recommending hashtags [53,54]. Unlike the studies mentioned above, Wang et al [43] first perform Optical Character Recognition (OCR) to extract explicit optical characters from the input image and then utilize an image captioning model to extract implicit image attributes that reflect the semantic information of the image. To better integrate multi-modal information, they then introduce a multimodal multi-head attention to model the semantic interactions between different modalities.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations