Multi-modal Summarization for Asynchronous Collection of Text,
            Image, Audio and Video

Li, Haoran; Zhu, Junnan; Ma, Cong; Zhang, Jiajun; Zong, Chengqing

doi:10.18653/v1/d17-1114

Cited by 73 publications

(66 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Erol et al [54] proposed a method for detecting important segments of a recorded meeting based on activity analysis, which simply measured audio amplitude and luminance difference between two video frames as well as text analysis using tf-idf. More recently, Li et al [55] proposed an extractive multi-modal summarization method that selects salient sentences by considering the images, audio, and videos related to a specific topic. However, they did not address the issue of meeting summarization.…”

Section: Text Speech and Meeting Summarizationmentioning

confidence: 99%

Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization

Nihei

Nakano

2019

MTI

View full text Add to dashboard Cite

Meeting minutes are useful, but creating meeting summaries are a time consuming task. Aiming at supporting such task, this paper proposes prediction models for important utterances that should be included in the meeting summary by using multimodal and multiparty features. We will tackle this issue from two approaches: Handcrafted feature models and deep neural network models. The best handcrafted feature model achieved 0.707 in F-measure, and the best deep-learning based verbal and nonverbal model (V-NV model) achieved 0.827 in F-measure. Based on the V-NV model, we implemented a meeting browser, and conducted a user study. The results showed that the proposed meeting browser better contributes to the understanding of the content of the discussion and the participant roles in the discussion than the conventional text-based browser.

show abstract

Section: Text Speech and Meeting Summarizationmentioning

confidence: 99%

Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization

Nihei

Nakano

2019

MTI

View full text Add to dashboard Cite

show abstract

“…Multimodal summarization has been proposed to extract the most important information from the multimedia information. The most significant difference between multimodal summarization (Mademlis et al 2016;Li et al 2017;2018b;Zhu et al 2018) and text summarization (Zhu et al 2017;Paulus, Xiong, and Socher 2018;Celikyilmaz et al 2018;Li et al 2018c;Zhu et al 2019) lies in whether the input data contains two or more modalities of data. One of the most significant advantages of the task is that it can use the rich information in multimedia data to improve the quality of the final summary.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Summarization with Guidance of Multimodal Reference

Zhu

Zhou

Zhang

et al. 2020

AAAI

Self Cite

View full text Add to dashboard Cite

Multimodal summarization with multimodal output (MSMO) is to generate a multimodal summary for a multimodal news report, which has been proven to effectively improve users' satisfaction. The existing MSMO methods are trained by the target of text modality, leading to the modality-bias problem that ignores the quality of model-selected image during training. To alleviate this problem, we propose a multimodal objective function with the guidance of multimodal reference to use the loss from the summary generation and the image selection. Due to the lack of multimodal reference data, we present two strategies, i.e., ROUGE-ranking and Order-ranking, to construct the multimodal reference by extending the text reference. Meanwhile, to better evaluate multimodal outputs, we propose a novel evaluation metric based on joint multimodal representation, projecting the model output and multimodal reference into a joint semantic space during evaluation. Experimental results have shown that our proposed model achieves the new state-of-the-art on both automatic and manual evaluation metrics. Besides, our proposed evaluation method can effectively improve the correlation with human judgments.

show abstract

“…A few deep learning frameworks [2,11,31] show promising results, too. Li et al [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary. Although some work on document summarization has been done using ILP, to the best of our knowledge no one has ever used an ILP framework in the area of multi-modal summarization.…”

Section: Related Workmentioning

confidence: 99%

“…There is no benchmark dataset for the TIVS task. Therefore, we created our own text-image-video dataset by extending and manually annotating the multi-modal summarization dataset introduced by Li et al [12]. Their dataset comprised of 25 new topics.…”

Section: Dataset Preparationmentioning

confidence: 99%

“…To sum up, we make the following contributions: (1) We present a novel multimodal summarization task which takes news with images and videos as input, and outputs text, images and video as summary. (2) We create an extension of the multi-modal summarization dataset [12] by constructing multi-modal references containing text, images and video for each topic. (3) We design a joint ILP framework to address the proposed multi-modal summarization task.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text-Image-Video Summary Generation Using Joint Integer Linear Programming

Jangra

Jatowt

Hasanuzzaman

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the Internet. However, the current multi-modal systems usually generate summaries composed of text and images. In this paper, we propose a novel research problem of text-image-video summary generation (TIVS). We first develop a multi-modal dataset containing text documents, images and videos. We then propose a novel joint integer linear programming multi-modal summarization (JILP-MMS) framework. We report the performance of our model on the developed dataset.

show abstract

Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Cited by 73 publications

References 32 publications

Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization

Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization

Multimodal Summarization with Guidance of Multimodal Reference

Text-Image-Video Summary Generation Using Joint Integer Linear Programming

Contact Info

Product

Resources

About