2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00497
|View full text |Cite
|
Sign up to set email alerts
|

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Abstract: The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-ofthe-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions)… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
19
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 40 publications
(20 citation statements)
references
References 35 publications
0
19
0
Order By: Relevance
“…There are two unique challenges posed in our TeViS task. First, the text synopses are diverse covering a wide range of topics and some of them are also high-level and abstract, e.g., 2.76 concreteness score on average (vs. 2.99 in other video-text datasets such as LSMDC [32] and MAD [36]). Therefore, it is much more difficult to visualize texts with relevant images.…”
Section: Cinematic Coherent Transitions Across Keyframesmentioning
confidence: 99%
See 1 more Smart Citation
“…There are two unique challenges posed in our TeViS task. First, the text synopses are diverse covering a wide range of topics and some of them are also high-level and abstract, e.g., 2.76 concreteness score on average (vs. 2.99 in other video-text datasets such as LSMDC [32] and MAD [36]). Therefore, it is much more difficult to visualize texts with relevant images.…”
Section: Cinematic Coherent Transitions Across Keyframesmentioning
confidence: 99%
“…Condensed Movie Dataset (CMD) [1] consists of key scenes from the movie, each of which is accompanied by a high-level semantic description of the scene. The MAD [36] dataset is based on the LSMDC [32] dataset.…”
Section: Movie Understandingmentioning
confidence: 99%
“…Video temporal grounding [5,6,13,18,19,29,31,32], which aims to localize a specific moment in the video corresponding to a natural language description, has found its applications in many real-world scenarios, such as video retrieval [11,22], video highlight detection [23,24], and video question answering [9,26].…”
Section: Introductionmentioning
confidence: 99%
“…Detailed discussion can be found in Sec. 4.5. to the case of long-form video temporal grounding (LVTG) [6,18], however, temporally downsampling a video (e.g., in hours) to so few frames could cause severe information loss and further result in drastic performance degradation [6].…”
Section: Introductionmentioning
confidence: 99%
“…Nowadays, millions of videos are produced every day, and high demand arises for automatic video processing and analysis. To this end, various tasks have emerged, for example, action recognition [19], active speaker detection [2], videolanguage grounding [41], temporal action localization [26,42]. Among those tasks, temporal action detection in untrimmed videos, in particular, is one of the fundamental yet challenging tasks.…”
mentioning
confidence: 99%