Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Venugopalan, Subhashini; Xu, Huaxi; Donahue, Jeff; Rohrbach, Marcus; Mooney, Raymond J.; Saenko, Kate

doi:10.48550/arxiv.1412.4729

Cited by 103 publications

(163 citation statements)

References 0 publications

Supporting

Mentioning

163

Contrasting

Order By: Relevance

“…Video Question Answering. In Table 3, zeroshot VideoCLIP outperforms most supervised DiDeMo dataset R@1 ↑R@5 SUPERVISED S2VT (Venugopalan et al, 2014) 11.9 33.6 FSE (Zhang et al, 2018 13.9 44.5 CE (Liu et al, 2019a) 16.1 41.1 ClipBERT 20.4 48.0 ZERO-SHOT VideoCLIP (Zero-shot)…”

Section: Resultsmentioning

confidence: 99%

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Ghosh¹,

Huang²,

Okhonko³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present VideoCLIP, a contrastive approach to pre-train a unified model for zeroshot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-ofthe-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/ fairseq/tree/main/examples/MMPT.

show abstract

Section: Resultsmentioning

confidence: 99%

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Ghosh¹,

Huang²,

Okhonko³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…5. Specifically, one of the early works [104], which is only applicable for videos of short duration, employs mean-pooling to frame representations extracted by a shared CNN and utilizes an LSTM architecture for caption generation. To extend the validity of extracted features to longer durations, recurrent visual encoder architectures are used [94,105,106].…”

Section: Image and Video Captioningmentioning

confidence: 99%

Towards Goal-Oriented Semantic Signal Processing: Applications and Future Challenges

Kalfa,

Gok,

Atalik

et al. 2021

Preprint

View full text Add to dashboard Cite

Advances in machine learning technology have enabled real-time extraction of semantic information in signals which can revolutionize signal processing techniques and improve their performance significantly for the next generation of applications. With the objective of a concrete representation and efficient processing of the semantic information, we propose and demonstrate a formal graph-based semantic language and a goal filtering method that enables goal-oriented signal processing. The proposed semantic signal processing framework can easily be tailored for specific applications and goals in a diverse range of signal processing applications. To illustrate its wide range of applicability, we investigate several use cases and provide details on how the proposed goal-oriented semantic signal processing framework can be customized. We also investigate and propose techniques for communications where sensor data is semantically processed and semantic information is exchanged across a sensor network.

show abstract

“…[35,36] tailor better recurrent layers that are easy to stack deep for higher-dimensional video information. These recurrent-based methods have advantages over convolutional ones for tasks sensitive to sequence order, such as video future prediction [32,56,67], trajectory prediction [47], and video description [4,55]. While for tasks that more focus on integrated features like action recognition [65,25,37,2,6,24,23], there is still a gap between the recurrent and convolutional models.…”

Section: Related Workmentioning

confidence: 99%

PGT: A Progressive Method for Training Models on Long Videos

Pang¹,

Gao²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Convolutional video models have an order of magnitude larger computational complexity than their counterpart image-level models. Constrained by computational resources, there is no model or training method that can train long video sequences end-to-end. Currently, the mainstream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow. Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property, and train it as a whole by progressively propagating information through the temporal dimension in multiple steps. This progressive training (PGT) method is able to train long videos end-to-end with limited resources and ensures the effective transmission of information. As a general and robust training method, we empirically demonstrate that it yields significant performance improvements on different models and datasets. As an illustrative example, the proposed method improves SlowOnly network by 3.7 mAP on Charades and 1.9 top-1 accuracy on Kinetics with negligible parameter and computation overhead. Code is available at: https://github.com/BoPang1996/PGT.

show abstract

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Cited by 103 publications

References 0 publications

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Towards Goal-Oriented Semantic Signal Processing: Applications and Future Challenges

PGT: A Progressive Method for Training Models on Long Videos

Contact Info

Product

Resources

About