Video Object Grounding Using Semantic Roles in Language Description

Sadhu, Arka; Chen, Kan; Nevatia, Ram

doi:10.1109/cvpr42600.2020.01043

Cited by 43 publications

(30 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SRL in Vision: has been explored in the context of human object interaction (Gupta and Malik, 2015), situation recognition (Yatskar et al, 2016), and multi-media extraction (Li et al, 2020). Most related to ours is the usage of SRLs for grounding (Silberer and Pinkal, 2018) in images and videos (Sadhu et al, 2020). Our work builds on (Sadhu et al, 2020) in using SRLs on video descriptions, however, our focus is not on grounding.…”

Section: Related Workmentioning

confidence: 99%

“…Most related to ours is the usage of SRLs for grounding (Silberer and Pinkal, 2018) in images and videos (Sadhu et al, 2020). Our work builds on (Sadhu et al, 2020) in using SRLs on video descriptions, however, our focus is not on grounding. Instead, we use SRLs primarily as a query generation tool and use the argument as a question directive.…”

Section: Related Workmentioning

confidence: 99%

“…We simulate the balancing process using the contrastive sampling method used in (Sadhu et al, 2020). Specifically, for a given video-query-answer (V 1 , Q 1 , A 1 ) tuple we retrieve another video-queryanswer (V 2 , Q 2 , A 2 ) tuple which share the same semantic role structure as well as lemmatized noun and verbs for the question, but a different lemmatized noun for the answer.…”

Section: Evaluating Answer Phrasesmentioning

confidence: 99%

“…To mitigate the language-bias issue, we emulate the procedure proposed by (Goyal et al, 2017) where for a given question, another image (or video in our case) is retrieved which has a different answer for the same question. To retrieve such a video, we use a contrastive sampling method (Sadhu et al, 2020) over the dataset by comparing only the lemmatized nouns and verbs within the semantic roles (SRLs). We then propose contrastive scoring to combine the scores of the two answer phrases obtained from the contrastive samples (details on evaluation in Section 3.2).…”

Section: Introductionmentioning

confidence: 99%

“…To investigate VidQAP, we extend three visionlanguage models namely, Bottom-Up-Top-Down (Anderson et al, 2018), VOGNet (Sadhu et al, 2020) and a Multi-Modal Transformer by replacing their classification heads with a Transformer (Vaswani et al, 2017) based language decoder. To facilitate research on VidQAP we construct two datasets ActivityNet-SRL-QA (ASRL-QA) and Charades-SRL-QA and provide a thorough analysis of extended models to serve as a benchmark for future research (details on model framework in Section 3.3 and dataset creation in Section 4.1).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Video Question Answering with Phrases via Semantic Roles

Sadhu

Chen

Nevatia

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fillin-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language-bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We perform extensive analysis and ablative studies to guide future work. Code and data are public.

show abstract