2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00623
|View full text |Cite
|
Sign up to set email alerts
|

Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
69
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 87 publications
(82 citation statements)
references
References 31 publications
0
69
0
Order By: Relevance
“…It was first studied in the image domain (Zhao et al, 2018;. Later, given a sequence of transcriptions and their corresponding video clips as well as their temporal alignment, Huang et al (2018) Figure 2: The architecture of our model. An instance generator is used to produce spatio-temporal instances.…”
Section: Related Workmentioning
confidence: 99%
“…It was first studied in the image domain (Zhao et al, 2018;. Later, given a sequence of transcriptions and their corresponding video clips as well as their temporal alignment, Huang et al (2018) Figure 2: The architecture of our model. An instance generator is used to produce spatio-temporal instances.…”
Section: Related Workmentioning
confidence: 99%
“…As the ordering supervision can be automatically extracted from language, our work is related to using language as supervision for videos. The supervision usually comes from movie scripts [8,2,41] or transcription of instructional videos [1,33,25,14]. Unlike these approaches, we assume the discrete action labels are already extracted and focus on leveraging the ordering information as supervision.…”
Section: Related Workmentioning
confidence: 99%
“…In 2017, Huang et al [31] studied the task of reference resolution, which aimed to temporally link an entity to the original action that produced it. In 2018, they further investigated the visual grounding problem [32], which explored the visual-linguistic meaning of referring expressions in both spatial and temporal domains. In the same year, Zhou et al [82] presented a procedure segmentation task, targeting at segmenting an instructional video into category-independent procedure segments.…”
Section: Tasks For Instructional Video Analysismentioning
confidence: 99%
“…As an extension work, they further provided the object level annotation [81]. Both the YouCook and YouCook2 datasets can be used for the video caption tasks, and the YouCook2 further facilitated the research for procedure segmentation [82] and video object grounding [32], [81].…”
Section: Datasetsmentioning
confidence: 99%