MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan, Mattia; Pardo, Alejandro; Alcázar, Juan León; Heilbron, Fabian Caba; Zhao, Chen; Giancola, Silvio; Ghanem, Bernard

doi:10.1109/cvpr52688.2022.00497

Cited by 40 publications

(20 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are two unique challenges posed in our TeViS task. First, the text synopses are diverse covering a wide range of topics and some of them are also high-level and abstract, e.g., 2.76 concreteness score on average (vs. 2.99 in other video-text datasets such as LSMDC [32] and MAD [36]). Therefore, it is much more difficult to visualize texts with relevant images.…”

Section: Cinematic Coherent Transitions Across Keyframesmentioning

confidence: 99%

See 1 more Smart Citation

Translating Text Synopses to Video Storyboards

Gu¹,

Sun²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset [15]. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-andlanguage model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.

show abstract

Section: Cinematic Coherent Transitions Across Keyframesmentioning

confidence: 99%

“…Condensed Movie Dataset (CMD) [1] consists of key scenes from the movie, each of which is accompanied by a high-level semantic description of the scene. The MAD [36] dataset is based on the LSMDC [32] dataset.…”

Section: Movie Understandingmentioning

confidence: 99%

Translating Text Synopses to Video Storyboards

Gu¹,

Sun²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Video temporal grounding [5,6,13,18,19,29,31,32], which aims to localize a specific moment in the video corresponding to a natural language description, has found its applications in many real-world scenarios, such as video retrieval [11,22], video highlight detection [23,24], and video question answering [9,26].…”

Section: Introductionmentioning

confidence: 99%

“…Detailed discussion can be found in Sec. 4.5. to the case of long-form video temporal grounding (LVTG) [6,18], however, temporally downsampling a video (e.g., in hours) to so few frames could cause severe information loss and further result in drastic performance degradation [6].…”

Section: Introductionmentioning

confidence: 99%

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Pan¹,

He²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (e.g., in minutes), temporal grounding in long videos (e.g., in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with one-time network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (i.e., anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-ofthe-arts, and achieves 14.6× / 102.8× higher efficiency respectively. Project can be found at https://github. com/afcedf/SOONet.git.

show abstract

“…Nowadays, millions of videos are produced every day, and high demand arises for automatic video processing and analysis. To this end, various tasks have emerged, for example, action recognition [19], active speaker detection [2], videolanguage grounding [41], temporal action localization [26,42]. Among those tasks, temporal action detection in untrimmed videos, in particular, is one of the fundamental yet challenging tasks.…”

mentioning

confidence: 99%

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Zhao

Ramazanova

et al. 2023

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it as a proposal generation -classification problem. However, there are two caveats with this paradigm. First, proposals are not equipped with annotated labels, which have to be empirically compiled, thus the information in the annotations is not necessarily precisely employed in the model training process. Second, there are large variations in the temporal scale of actions, and neglecting this fact may lead to deficient representation in the video features. To address these issues and precisely model TAD, we formulate the task in a novel perspective of semantic segmentation. Owing to the 1dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We take advantage of them to provide precise supervision so as to mitigate the impact induced by the imprecise proposal labels. We propose a unified framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN). We evaluate SegTAD on two important large-scale datasets for action detection and it shows competitive performance on both datasets.

show abstract

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Cited by 40 publications

References 35 publications

Translating Text Synopses to Video Storyboards

Translating Text Synopses to Video Storyboards

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Contact Info

Product

Resources

About