2017
DOI: 10.1007/s11263-017-1018-6
|View full text |Cite
|
Sign up to set email alerts
|

Sentence Directed Video Object Codiscovery

Abstract: Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
11
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(11 citation statements)
references
References 55 publications
0
11
0
Order By: Relevance
“…These are limited to small set of nouns. Object co-localization focuses on discovering and detecting an object in images or videos without any bounding box annotation, but only from image/video level labels [3,6,23,30,38,40,51]. These works are similar to ours with respect to the amount of supervision, but they focus on a few discrete classes, while our approach can handle arbitrary phrases and allows for localization of novel phrases.…”
Section: Related Workmentioning
confidence: 99%
“…These are limited to small set of nouns. Object co-localization focuses on discovering and detecting an object in images or videos without any bounding box annotation, but only from image/video level labels [3,6,23,30,38,40,51]. These works are similar to ours with respect to the amount of supervision, but they focus on a few discrete classes, while our approach can handle arbitrary phrases and allows for localization of novel phrases.…”
Section: Related Workmentioning
confidence: 99%
“…In recent years, researchers also explore grounding in videos. Yu and Siskind (2015) grounded objects in constrained videos by leveraging weak semantic constraints implied by a sequence of sentences. Vasudevan et al (2018) grounded objects in the last frame of stereo videos with the help of text, motion cues, human gazes and spatial-temporal context.…”
Section: Related Workmentioning
confidence: 99%
“…As such the work revisits well-studied terrain (Coyne and Sproat, 2001). Another recent study in this area is Yu and Siskind (2017), wherein spatial relation models are used to locate and identify similar objects in several video streams. We should separately mention the spatial modelling studies by Malinowski and Fritz (2014) and, especially, Collell et al (2017), which apply deep neural networks to learning spatial templates for triplets of form (relatum, relation, referent).…”
Section: Related Workmentioning
confidence: 99%