2020
DOI: 10.1007/978-3-030-58526-6_26
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Learn Words from Visual Scenes

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(15 citation statements)
references
References 33 publications
0
15
0
Order By: Relevance
“…While the majority of works that used EPIC-KITCHENS have focused on action recognition and anticipation, in line with the defined challenges, our dataset lends itself naturally to a variety of less explored tasks. Of these, recent research has explored using EPIC-KITCHENS for: video object reasoning and detection [67], [68], action retrieval [69], visual learning of novel words [57], unsupervised domain adaptation [70] and learning environmental affordances [71]. Using EPIC-KITCHENS for these tasks has only been made possible due to the choices made when collecting this dataset.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…While the majority of works that used EPIC-KITCHENS have focused on action recognition and anticipation, in line with the defined challenges, our dataset lends itself naturally to a variety of less explored tasks. Of these, recent research has explored using EPIC-KITCHENS for: video object reasoning and detection [67], [68], action retrieval [69], visual learning of novel words [57], unsupervised domain adaptation [70] and learning environmental affordances [71]. Using EPIC-KITCHENS for these tasks has only been made possible due to the choices made when collecting this dataset.…”
Section: Discussionmentioning
confidence: 99%
“…Those have only been used in Section 4.1 for the object detection challenge. However, a surge of recent approaches have used the object bounding boxes for action recognition on our dataset [57], [58], [59]. Modality and Fusion Results: In Table 8, we present results of the Temporal Segment Network (TSN) on the three modalities separately -RGB, Flow and Audio, as well as their fusion.…”
Section: Action Recognition Benchmarkmentioning
confidence: 99%
“…However, these works only extract entities from captions, while we also learn from the properties and relations described. Also related are recent methods that use supervision from visual-language pairs [10,30,33,43,47], but these learn general-purpose representations and do not perform scene graph generation.…”
Section: Related Workmentioning
confidence: 99%
“…Capturing compositionality in language has been a long challenge (Fodor et al, 1988) for neural networks. Recent works explore the problem with compositional generalization on synthetic instruction following (Lake and Baroni, 2017), text-based games (Yuan et al, 2019), visual question answering (Bahdanau et al, 2019), and visually grounded masked word prediction (Surís et al, 2019). In particular, study a closely related task of continual learning of sequence prediction for synthetic instruction following.…”
Section: Related Workmentioning
confidence: 99%