Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413614
|View full text |Cite
|
Sign up to set email alerts
|

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Abstract: In this paper, we study the problem of weakly-supervised spatiotemporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 30 publications
0
9
0
Order By: Relevance
“…Recently, Yang et al [41] simultaneously consider spatial and temporal contextual similarities of regions and frames in an end-to-end manner. Meanwhile, Chen et al [2] choose to enhance the textual representations of objects by exploiting the activities described in the sentence. While existing methods have not fully exploited the potential of the description sentences for vision-language alignment in rich contextual information and stable learning, we propose a novel frame-level MIL-based WSVOG framework which jointly enjoys the merits of them.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Recently, Yang et al [41] simultaneously consider spatial and temporal contextual similarities of regions and frames in an end-to-end manner. Meanwhile, Chen et al [2] choose to enhance the textual representations of objects by exploiting the activities described in the sentence. While existing methods have not fully exploited the potential of the description sentences for vision-language alignment in rich contextual information and stable learning, we propose a novel frame-level MIL-based WSVOG framework which jointly enjoys the merits of them.…”
Section: Related Workmentioning
confidence: 99%
“…where Temp(•) is a function that synthetically considers the temporal region proposals and queried objects, and outputs the likelihood of each frame-bag being positive. There are multiple implementations of Temp(•) function in previous works [2,31,41], such as firstly concatenating the pooled features of region proposals and queried objects and then utilizing MLPs with softmax to transform them into likelihood [41].…”
Section: Formulation Of Mil-based Frameworkmentioning
confidence: 99%
See 3 more Smart Citations