Learning to Learn Words from Visual Scenes

Surís, Dídac; Epstein, Dave; Ji, Heng; Chang, Shu‐Fen; Vondrick, Carl

doi:10.1007/978-3-030-58526-6_26

Cited by 16 publications

(15 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the majority of works that used EPIC-KITCHENS have focused on action recognition and anticipation, in line with the defined challenges, our dataset lends itself naturally to a variety of less explored tasks. Of these, recent research has explored using EPIC-KITCHENS for: video object reasoning and detection [67], [68], action retrieval [69], visual learning of novel words [57], unsupervised domain adaptation [70] and learning environmental affordances [71]. Using EPIC-KITCHENS for these tasks has only been made possible due to the choices made when collecting this dataset.…”

Section: Discussionmentioning

confidence: 99%

“…Those have only been used in Section 4.1 for the object detection challenge. However, a surge of recent approaches have used the object bounding boxes for action recognition on our dataset [57], [58], [59]. Modality and Fusion Results: In Table 8, we present results of the Temporal Segment Network (TSN) on the three modalities separately -RGB, Flow and Audio, as well as their fusion.…”

Section: Action Recognition Benchmarkmentioning

confidence: 99%

See 1 more Smart Citation

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Damen¹,

Doughty²,

Farinella³

et al. 2020

Preprint

View full text Add to dashboard Cite

Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions (e.g. 'closing a tap' from 'opening' it up).

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Action Recognition Benchmarkmentioning

confidence: 99%

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Damen¹,

Doughty²,

Farinella³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…However, these works only extract entities from captions, while we also learn from the properties and relations described. Also related are recent methods that use supervision from visual-language pairs [10,30,33,43,47], but these learn general-purpose representations and do not perform scene graph generation.…”

Section: Related Workmentioning

confidence: 99%

Linguistic Structures as Weak Supervision for Visual Scene Graph Generation

Ye¹,

Kovashka²

2021

Preprint

View full text Add to dashboard Cite

Prior work in scene graph generation requires categorical supervision at the level of triplets-subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic task: thus holistic, contextual supervision should intuitively improve performance. In this work, we explore how linguistic structures in captions can benefit scene graph generation. Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects (e.g. visual properties are mentioned). Captions are a weaker type of supervision than triplets since the alignment between the exhaustive list of human-annotated subjects and objects in triplets, and the nouns in captions, is weak. However, given the large and diverse sources of multimodal data on the web (e.g. blog posts with images and captions), linguistic supervision is more scalable than crowdsourced triplets. We show extensive experimental comparisons against prior methods which leverage instance-and image-level supervision, and ablate our method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization of subjects and objects.

show abstract

“…Capturing compositionality in language has been a long challenge (Fodor et al, 1988) for neural networks. Recent works explore the problem with compositional generalization on synthetic instruction following (Lake and Baroni, 2017), text-based games (Yuan et al, 2019), visual question answering (Bahdanau et al, 2019), and visually grounded masked word prediction (Surís et al, 2019). In particular, study a closely related task of continual learning of sequence prediction for synthetic instruction following.…”

Section: Related Workmentioning

confidence: 99%

Visually Grounded Continual Learning of Compositional Phrases

Jin

Sadhu

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Humans acquire language continually with much more limited access to data samples at a time, as compared to contemporary NLP systems. To study this human-like language acquisition ability, we present VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes. In the task, models are trained on a paired image-caption stream which has shifting object distribution; while being constantly evaluated by a visually-grounded masked language prediction task on held-out test sets. VisCOLL compounds the challenges of continual learning (i.e., learning from continuously shifting data distribution) and compositional generalization (i.e., generalizing to novel compositions). To facilitate research on VisCOLL, we construct two datasets, COCO-shift and Flickrshift, and benchmark them using different continual learning methods. Results reveal that SoTA continual learning approaches provide little to no improvements on VisCOLL, since storing examples of all possible compositions is infeasible. We conduct further ablations and analysis to guide future work 1 .

show abstract

Learning to Learn Words from Visual Scenes

Cited by 16 publications

References 33 publications

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Linguistic Structures as Weak Supervision for Visual Scene Graph Generation

Visually Grounded Continual Learning of Compositional Phrases

Contact Info

Product

Resources

About