A Survey on Temporal Sentence Grounding in Videos

Lan, Xiaohan; Yuan, Yitian; Wang, Xin; Wang, Zhi; Zhu, Wenwu

doi:10.48550/arxiv.2109.08039

Cited by 2 publications

(2 citation statements)

References 79 publications

(153 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 1 shows that the current datasets comprise relatively short videos, containing single structured scenes, and language descriptions that cover most of the video. Furthermore, the temporal anchors for the language are temporally biased, leading to methods not learning from any visual features and eventually overfitting to temporal priors for specific actions, thus limiting their generalization capabilities [9,18].…”

Section: Introductionmentioning

confidence: 99%

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan¹,

Pardo²,

Alcázar³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-ofthe-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384, 000 natural language sentences grounded in over 1, 200 hours of videos and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours. We have released MAD's data and baselines code at https://github.com/ Soldelli/MAD.

show abstract

Section: Introductionmentioning

confidence: 99%

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan¹,

Pardo²,

Alcázar³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Figure 1 shows that the current datasets comprise relatively short videos, contain single structured scenes, and language descriptions that cover most of the video. Furthermore, the temporal anchors for the language are temporally biased in time (refer to Figure 3), leading to methods not learning from any visual features and eventually overfitting to temporal priors for specific actions, thus limiting their generalization capabilities [7,16].…”

Section: Introductionmentioning

confidence: 99%

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Soldan¹,

Pardo²,

Alcázar³

et al. 2021

Preprint

View full text Add to dashboard Cite

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours.

show abstract

A Survey on Temporal Sentence Grounding in Videos

Cited by 2 publications

References 79 publications

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Contact Info

Product

Resources

About