2018
DOI: 10.1007/978-3-030-01261-8_7
|View full text |Cite
|
Sign up to set email alerts
|

Object Level Visual Reasoning in Videos

Abstract: Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges in activity recognition require a level of understanding that pushes beyond this and call for models with capabilities for fine distinction and detailed comprehension of interactions between actors and objects in a scene. We propose a model capable of learning to reason about… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
165
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 148 publications
(174 citation statements)
references
References 32 publications
0
165
1
Order By: Relevance
“…In [57], the relationship between tasks is modelled in a latent space to transfer knowledge between them and reduce the number of required training samples. MTL in egocentric vision appears in [1,28,25,18,29,47].…”
Section: Multitask Learningmentioning
confidence: 99%
“…In [57], the relationship between tasks is modelled in a latent space to transfer knowledge between them and reduce the number of required training samples. MTL in egocentric vision appears in [1,28,25,18,29,47].…”
Section: Multitask Learningmentioning
confidence: 99%
“…For the Epic Kitchen Dataset [4], there are a total of 125 verb classes and each verb can be acted on different objects. We report the results on the validation set using the same split as [1]. We only evaluate on the verb class prediction following [1] since the main purpose of this paper is on temporal action recognition instead of objects.…”
Section: Results On Ego-motion Action Recognitionmentioning
confidence: 99%
“…We report the results on the validation set using the same split as [1]. We only evaluate on the verb class prediction following [1] since the main purpose of this paper is on temporal action recognition instead of objects. For the EGTEA Gaze++ dataset, it contains 106 classes with 19 different verbs.…”
Section: Results On Ego-motion Action Recognitionmentioning
confidence: 99%
See 2 more Smart Citations