Controllable Video Captioning with an Exemplar Sentence

Yuan, Yitian; Ma, Lin; Wang, Jingwen; Zhu, Wenwu

doi:10.1145/3394171.3413908

Cited by 19 publications

(6 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Convolutional Neural Networks for video understanding have been extensively studied and widely applied to video-text pre-training [36], cross-modal analysis [29], video detection [19], ecommerce [6][7][8], adversarial attack [4,[53][54][55], interactive search [37,58], retrieval [17,18,26,57,62,64,66], hyperlinking [9,20,21,39], and caption [2,47,61], in the CNN era; we select and review representative 3D-CNNs as follows. C3D [48] is a pure 3D-CNN pilot based on a new 3D Conv operator and easily outperforms 2D counterparts on video tasks.…”

Section: Related Workmentioning

confidence: 99%

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Zhang,

Cheng,

Hao

et al. 2022

Preprint

View full text Add to dashboard Cite

Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes times longer sequence than the latter under the current attention of quadratic complexity ( 2 2 ). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy.However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term "Leap A ention" (LA), short-term "Periodic Shi " (P-Shift) module for video transformers, with (2 2 ) complexity. Specifically, the "LA" groups long-term frames into pairs, then refactors each discrete pair via attention. The "P-Shift" exchanges features between temporal neighbors to confront the loss of shortterm dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (∼2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in https://github.com/VideoNetworks/LAPS-transformer .

show abstract

Section: Related Workmentioning

confidence: 99%

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Zhang,

Cheng,

Hao

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…According to the requirements of parallel training samples, existing solutions can be divided into two types: models using parallel stylized image-caption data [41,11,54,1] or not [22,42]. Subsequently, the community gradually shifts the emphasis to controlling described contents [16,77,27,10,78,48,35] or structures [20,19,75,76] [18,60,37,36,64], which aims to gen-erate discriminative and unique captions for individual images. Unfortunately, due to the subjective nature of diverse and distinctive captions, effective evaluation remains as an open problem, and several new metrics are proposed, such as SPICE-U [67], CIDErBtw [64], self-CIDEr [66], word recall [58], mBLEU [52].…”

Section: Related Workmentioning

confidence: 99%

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Chen¹,

Jiang²,

Xiao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Cap: a man riding a wave on a surfboard. CS: Cap: a man riding a wave on a surfboard in his hand in the sky. CS: level 3 (15-19) Cap: a group of people sitting next to each other in front of a tree. CS: level 4 (20-25) Cap: a group of men and two boys are standing in front of a refrigerator in front of a house. CS: sit; Arg1, Arg2 Cap: a man sitting on a bench. CS: sit; Arg1, Arg2, LOC CS: read; , Arg1 Cap: a man on a playground reading a book. CS: read; Arg0, Arg1, Cap: a man reading a book on a bench in a park. Cap: a man sitting on a bench next to a playground.

show abstract

“…Although retrieval-based methods can find human-like sentences with similar semantics to the video, it is challenging to generate an entirely correct description due to limited retrieval samples. With the advent of the encoder-decoder framework, most of the current work is studying how to better use visual features [35,23,1,36,22,24] and design elaborate models [34,2,28,13] to generate sentences directly. However, the diversity and controllability of sentences generated in this way are not satisfactory.…”

Section: Related Workmentioning

confidence: 99%

“…Video captioning is one of the most important visionlanguage tasks, and it seeks to automatically describe what has happened in the video according to the visual content. Recently, many promising methods [36,22,24,34,2] have been proposed to address this task. These methods mainly focus on learning the spatial-temporal representations of videos to fully tap visual information and devising novel decoders to achieve visual-textual alignment or controllable decoding.…”

Section: Introductionmentioning

confidence: 99%

Open-book Video Captioning with Retrieve-Copy-Generate Network

Zhang¹,

Qi²,

Yuan³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we convert traditional video captioning task into a new paradigm, i.e., Open-book Video Captioning, which generates natural language under the prompts of video-content-relevant sentences, not limited to the video itself. To address the open-book video captioning problem, we propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively, and a copy-mechanism generator is introduced to extract expressions from multi-retrieved sentences dynamically. The two modules can be trained end-to-end or separately, which is flexible and extensible. Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video. Extensive experiments on several benchmark datasets show that our proposed approach surpasses the state-of-the-art performance, indicating the effectiveness and promising of the proposed paradigm in the task of video captioning.

show abstract

Controllable Video Captioning with an Exemplar Sentence

Cited by 19 publications

References 45 publications

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Open-book Video Captioning with Retrieve-Copy-Generate Network

Contact Info

Product

Resources

About