Visual Attention Model for Name Tagging in Multimodal Social Media

Lu, Di; Neves, Lucio Pereira; Carvalho, Vitor R.; Zhang, Ning; Ji, Heng

doi:10.18653/v1/p18-1185

Cited by 187 publications

(147 citation statements)

References 29 publications

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…Potential improvements include, for example, accounting for the original multi-label nature of emotion classification, or covering more than only 20 emoji in emoji prediction. There are also other scenarios to be addressed as well, like sequence tagging (Baldwin et al, 2015;Gimpel et al, 2018), multimodality (Schifanella et al, 2016;Lu et al, 2018), and codeswitching tasks (Barman et al, 2014;Vilares et al, 2016). This is similar to the evolution of GLUE (Wang et al, 2019b) into SuperGLUE (Wang et al, 2019a), with both benchmarks contributing to the development of the field in different ways.…”

Section: Discussionmentioning

confidence: 99%

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Barbieri

Camacho-Collados

Espinosa-Anke

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

348

205

View full text Add to dashboard Cite

The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domainspecific data. In this paper, we propose a new evaluation framework (TWEETEVAL) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pretrained generic language models, and continue training them on Twitter corpora.

show abstract

Section: Discussionmentioning

confidence: 99%

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Barbieri

Camacho-Collados

Espinosa-Anke

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

348

205

View full text Add to dashboard Cite

show abstract

“…Attention mechanism was initially proposed in neural machine translation to dynamically adjust the focus on the source sentence (Bahdanau et al, 2014), but its application has been extended to many areas including multimodal fusion (Lu et al, 2018;Ghosal et al, 2018;. The idea of attention is to use the information of a vector (called query) to weighted-sum a list of vectors (called context).…”

Section: Attention Mechanismmentioning

confidence: 99%

“…Zhong et al (2016) also studied the combination of image and captions for the task of detecting cyberbullying. For the task of name tagging, formulated as a sequence labeling problem, Lu et al (2018) apply a visual attention model to put the focus on the sub-areas of a photo that are more relevant to the text encoded by a bi-LSTM model. For the task of image-text matching, Wang et al (2017) compare an embedding network that projects texts and photos into a joint space where semantically-similar texts and photos are close to each other, with a similarity network that fuses text embeddings and photo embeddings via element multiplication.…”

Section: Introductionmentioning

confidence: 99%

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Yang¹,

Xu²,

Ghosh³

et al. 2019

Proceedings of the Third Workshop on Abusive Language Online

View full text Add to dashboard Cite

Interactions among users on social network platforms are usually positive, constructive and insightful. However, sometimes people also get exposed to objectionable content such as hate speech, bullying, and verbal abuse etc. Most social platforms have explicit policy against hate speech because it creates an environment of intimidation and exclusion, and in some cases may promote real-world violence. As users' interactions on today's social networks involve multiple modalities, such as texts, images and videos, in this paper we explore the challenge of automatically identifying hate speech with deep multimodal technologies, extending previous research which mostly focuses on the text signal alone. We present a number of fusion approaches to integrate text and photo signals. We show that augmenting text with image embedding information immediately leads to a boost in performance, while applying additional attention fusion methods brings further improvement.

show abstract

“…More recently, authors of [41] place an attention layer on top of several modality-specific feature encoding layers to model the importance of different modalities in book genre prediction. There are many other works [20,35,39,40] that leverage this technique, i.e. encoding sequential/temporal data for each modality before computing attention weighting and fusing encoded modality-specific features.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Data Fusion based on the Global Workspace Theory

Bao

Fountas

Olugbade

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

We propose a novel neural network architecture, named the Global Workspace Network (GWN), which addresses the challenge of dynamic and unspecified uncertainties in multimodal data fusion. Our GWN is a model of attention across modalities and evolving through time, and is inspired by the well-established Global Workspace Theory from the field of cognitive science. The GWN achieved average F1 score of 0.92 for discrimination between pain patients and healthy participants and average F1 score = 0.75 for further classification of three pain levels for a patient, both based on the multimodal EmoPain dataset captured from people with chronic pain and healthy people performing different types of exercise movements in unconstrained settings. In these tasks, the GWN significantly outperforms the typical fusion approach of merging by concatenation. We further provide extensive analysis of the behaviour of the GWN and its ability to address uncertainties (hidden noise) in multimodal data. CCS CONCEPTS • Computing methodologies → Machine learning algorithms; Neural networks.

show abstract

Visual Attention Model for Name Tagging in Multimodal Social Media

Cited by 187 publications

References 29 publications

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Multimodal Data Fusion based on the Global Workspace Theory

Contact Info

Product

Resources

About