NITS-VC System for VATEX Video Captioning Challenge 2020

Singh, Rajesh; Singh, Thoudam Doren; Bandyopadhyay, Sivaji

doi:10.48550/arxiv.2006.04058

Cited by 4 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other approaches such as [71], [43], and [108] participated in the VATEX video captioning challenge 2020 14 and report their results. The multi-features and hybrid reward strategy approach proposed in [108] was the winner of the video captioning competition and reports the highest result on the VATEX dataset.…”

Section: Results Of State-of-the-art Approachesmentioning

confidence: 99%

See 1 more Smart Citation

A Comprehensive Review on Recent Methods and Challenges of Video Description

Singh¹,

Singh²,

Bandyopadhyay³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Video description involves the generation of the natural language description of actions, events, and objects in the video. There are various applications of video description by filling the gap between languages and vision for visually impaired people, generating automatic title suggestion based on content, browsing of the video based on the content and video-guided machine translation [86] etc.In the past decade, several works had been done in this field in terms of approaches/methods for video description, evaluation metrics, and datasets. For analyzing the progress in the video description task, a comprehensive survey is needed that covers all the phases of video description approaches with a special focus on recent deep learning approaches. In this work, we report a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the video description, open challenges in this field, and future research directions. In this survey, we cover the state-of-the-art approaches proposed for each and every dataset with their pros and cons. For the growth of this research domain, the availability of numerous benchmark dataset is a basic need. Further, we categorize all the dataset into two classes: open domain dataset and domain-specific dataset. A brief discussion of the pros and cons of automatic evaluation metrics and human evaluation is also done in this survey. From our survey, we observe that the work in this field is in fast-paced development since the task of video description falls in the intersection of computer vision and natural language processing. But still, the work in the video description is far from saturation stage due to various challenges like the redundancy due to similar frames which affect the quality of visual features, the availability of dataset containing more diverse content and availability of an effective evaluation metric.

show abstract

Section: Results Of State-of-the-art Approachesmentioning

confidence: 99%

“…al. [71] proposed a video captioning framework in the VATEX challenge using two parallel LSTMs. The way of fusing visual representation with an embedded representation of the reference caption is different for both LSTM.…”

Section: Recentmentioning

confidence: 99%

A Comprehensive Review on Recent Methods and Challenges of Video Description

Singh¹,

Singh²,

Bandyopadhyay³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…More formally, we denote the task goal as T , input video demonstration as V , and the target text script as S={S 1 ,...,S n } involving n necessary and ordered steps. Compared to action anticipation (Girdhar and Grauman 2021;Zhong et al 2022) or video captioning (Singh, Singh, and Bandyopadhyay 2020), the generated scripts in our task are expected to be well-structured descriptions for a sequence of actions that follow a temporal and logical order.…”

Section: Dataset Design Task Formulationmentioning

confidence: 99%

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Qi,

Liu,

Shen

et al. 2024

AAAI

View full text Add to dashboard Cite

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge – MULTISCRIPT, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MULTISCRIPT covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MULTISCRIPT, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

show abstract

“…Various other video description datasets depicting everyday activities have been presented [3,6,32,52]. In this work, we mainly focus on the VATEX Captioning dataset [27], which has also been used in the Video-to-Text (VTT) task [17,27,42,[56][57][58]. Furthermore, we validate our models on the MSR-VTT [52] and MSVD [6] datasets.…”

Section: Related Workmentioning

confidence: 99%

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Harzig¹,

Einfalt²,

Lienhart³

2021

Preprint

View full text Add to dashboard Cite

Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advices for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VA-TEX dataset to determine a configuration applicable to unseen datasets that helps describing short video clips in natural language and improved the CIDEr and BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, FPE helps increase the CIDEr score by a relative factor of 8.6 %.

show abstract

NITS-VC System for VATEX Video Captioning Challenge 2020

Cited by 4 publications

References 15 publications

A Comprehensive Review on Recent Methods and Challenges of Video Description

A Comprehensive Review on Recent Methods and Challenges of Video Description

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Contact Info

Product

Resources

About