Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.473
|View full text |Cite
|
Sign up to set email alerts
|

MM-AVS: A Full-Scale Dataset for Multi-modal Summarization

Abstract: Multimodal summarization becomes increasingly significant as it is the basis for question answering, Web search, and many other downstream tasks. However, its learning materials have been lacking a holistic organization by integrating resources from various modalities, thereby lagging behind the research progress of this field. In this study, we present a full-scale multimodal dataset comprehensively gathering documents, summaries, images, captions, videos, audios, transcripts, and titles in English from CNN a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 13 publications
0
16
0
Order By: Relevance
“…For VMSMO dataset, the quality of chosen cover frame is evaluated by mean average precision (MAP) and recall at position (𝑅 𝑛 @𝑘) [81,108], where (𝑅 𝑛 @𝑘) measures if the positive sample is ranked in the top 𝑘 positions of 𝑛 candidates. For Daily Mail dataset and CNN dataset, we calculate the cosine image similarity (Cos) between image references and the extracted frames from videos [22,23]. We compare our MHMS model with existing multimodal summarization, video summarization, and textual summarization approaches.…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…For VMSMO dataset, the quality of chosen cover frame is evaluated by mean average precision (MAP) and recall at position (𝑅 𝑛 @𝑘) [81,108], where (𝑅 𝑛 @𝑘) measures if the positive sample is ranked in the top 𝑘 positions of 𝑛 candidates. For Daily Mail dataset and CNN dataset, we calculate the cosine image similarity (Cos) between image references and the extracted frames from videos [22,23]. We compare our MHMS model with existing multimodal summarization, video summarization, and textual summarization approaches.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…We evaluated our models on three datasets: VMSMO dataset [57], Daily Mail dataset, and CNN dataset from [22,23,57]. The popular COIN and Howto100M can not be used in our task, since they lack narrations and key-step annotation [54,80].…”
Section: Datasets and Baselines 41 Datasetsmentioning
confidence: 99%
See 3 more Smart Citations