Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1114
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Abstract: The rapid increase in multimedia data transmission over the Internet necessitates the multi-modal summarization (MMS) from collections of text, image, audio and video. In this work, we propose an extractive multi-modal summarization method that can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal content. For audio information, we design an approach to selectively use its tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
66
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 73 publications
(66 citation statements)
references
References 32 publications
0
66
0
Order By: Relevance
“…Erol et al [54] proposed a method for detecting important segments of a recorded meeting based on activity analysis, which simply measured audio amplitude and luminance difference between two video frames as well as text analysis using tf-idf. More recently, Li et al [55] proposed an extractive multi-modal summarization method that selects salient sentences by considering the images, audio, and videos related to a specific topic. However, they did not address the issue of meeting summarization.…”
Section: Text Speech and Meeting Summarizationmentioning
confidence: 99%
“…Erol et al [54] proposed a method for detecting important segments of a recorded meeting based on activity analysis, which simply measured audio amplitude and luminance difference between two video frames as well as text analysis using tf-idf. More recently, Li et al [55] proposed an extractive multi-modal summarization method that selects salient sentences by considering the images, audio, and videos related to a specific topic. However, they did not address the issue of meeting summarization.…”
Section: Text Speech and Meeting Summarizationmentioning
confidence: 99%
“…Multimodal summarization has been proposed to extract the most important information from the multimedia information. The most significant difference between multimodal summarization (Mademlis et al 2016;Li et al 2017;2018b;Zhu et al 2018) and text summarization (Zhu et al 2017;Paulus, Xiong, and Socher 2018;Celikyilmaz et al 2018;Li et al 2018c;Zhu et al 2019) lies in whether the input data contains two or more modalities of data. One of the most significant advantages of the task is that it can use the rich information in multimedia data to improve the quality of the final summary.…”
Section: Related Workmentioning
confidence: 99%
“…A few deep learning frameworks [2,11,31] show promising results, too. Li et al [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary. Although some work on document summarization has been done using ILP, to the best of our knowledge no one has ever used an ILP framework in the area of multi-modal summarization.…”
Section: Related Workmentioning
confidence: 99%
“…There is no benchmark dataset for the TIVS task. Therefore, we created our own text-image-video dataset by extending and manually annotating the multi-modal summarization dataset introduced by Li et al [12]. Their dataset comprised of 25 new topics.…”
Section: Dataset Preparationmentioning
confidence: 99%
See 1 more Smart Citation