Video captioning is process of summarising the content, event and action of the video into a short textual form which can be helpful in many research areas such as video guided machine translation, video sentiment analysis and providing aid to needy individual. In this paper, a system description of the framework used for VATEX-2020 video captioning challenge is presented. We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D) and in the decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in which visual features and input captions are fused separately and final output is generated by performing element-wise product between the output of both LSTMs. Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.
Sentiment analysis has come long way since it was introduced as a natural language processing task nearly 20 years ago. Sentiment analysis aims to extract the underlying attitudes and opinions toward an entity. It has become a powerful tool used by governments, businesses, medicine, marketing etc. The traditional sentiment analysis model focuses mainly on text content. However, technological advances have allowed people to express their opinions and feelings through audio, image and video channels. As a result, sentiment analysis is shifting from unimodality to multimodality. Multimodal sentiment analysis brings new opportunities with the rapid increase of sentiment analysis as complementary data streams enable improved and deeper sentiment detection which goes beyond text-based analysis. Audio and video channels are included in multimodal sentiment analysis in terms of broadness. People have been working on different approaches to improve sentiment analysis system performance by employing complex deep neural architectures. Recently, sentiment analysis has achieved significant success using the transformer-based model. This paper presents a comprehensive study of different sentiment analysis approaches, applications, challenges and resources then concludes that it holds tremendous potential. The primary motivation of this survey is to highlight changing trends in the unimodality to multimodality for solving sentiment analysis tasks.
A multimodal translation is a task of translating a source language to a target language with the help of a parallel text corpus paired with images that represent the contextual details of the text. In this paper, we carried out an extensive comparison to evaluate the benefits of using a multimodal approach on translating text in English to a low resource language, Hindi as a part of WAT2019 (Nakazawa et al., 2019) shared task. We carried out the translation of English to Hindi in three separate tasks with both the evaluation and challenge dataset. First, by using only the parallel text corpora, then through an image caption generation approach and, finally with the multimodal approach. Our experiment shows a significant improvement in the translation with the multimodal approach than the other approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.