Object Detection-Based Location and Activity Classification from Egocentric Videos: A Systematic Analysis

Kapidis, Georgios; Poppe, Ronald; Dam, Elsbeth van; Noldus, L.P.J.J.; Veltkamp, Remco C.

doi:10.1007/978-3-030-25590-9_6

Cited by 6 publications

(5 citation statements)

References 44 publications

(83 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [61], [62] optical flow was employed to detect salient regions, which were cropped from the original RGB frames and were given to the network as a second, more focused RGB stream. Other input modalities have been employed including depth [7], [41], egocentric cues comprising hand [63], [64], [65] and object regions [64], [66], [67], head motions [63] and gaze-based saliency maps [63], [65], sensor-based modalities [15], [56], [59] and sound [43], [68], [69]. In [38], [40] object and hand localization and segmentation were intermediate learning steps that forced the network to focus on important egocentric cues prior to action prediction.…”

Section: Video Activity Recognitionmentioning

confidence: 99%

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Kapidis

Poppe

Veltkamp

2023

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

For egocentric vision tasks such as action recognition, there is a relative scarcity of labeled data. This increases the risk of overfitting during training. In this paper, we address this issue by introducing a multitask learning scheme that employs related tasks as well as related datasets in the training process. Related tasks are indicative of the performed action, such as the presence of objects and the position of the hands. By including related tasks as additional outputs to be optimized, action recognition performance typically increases because the network focuses on relevant aspects in the video. Still, the training data is limited to a single dataset because the set of action labels usually differs across datasets. To mitigate this issue, we extend the multitask paradigm to include datasets with different label sets. During training, we effectively mix batches with samples from multiple datasets. Our experiments on egocentric action recognition in the EPIC-Kitchens, EGTEA Gaze+, ADL and Charades-EGO datasets demonstrate the improvements of our approach over single-dataset baselines. On EGTEA we surpass the current state-of-the-art by 2.47%. We further illustrate the cross-dataset task correlations that emerge automatically with our novel training scheme.

show abstract

Section: Video Activity Recognitionmentioning

confidence: 99%

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Kapidis

Poppe

Veltkamp

2023

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…• The Binary Presence Vector (BPV) of objects [68,70] from Chapter 3.2.3 consisting of zeros and ones with length equal to the number of noun classes of EPIC-Kitchens (352). The BPVs are concatenated to the hand coordinates for every frame and the feature size increases to 356 (352 + 4).…”

Section: Methodsmentioning

confidence: 99%

“…Furthermore, we reflect on the effect of object detection quality on the location and activity recognition outputs. Parts of this chapter are published in [68,70].…”

Section: Thesis Outlinementioning

confidence: 99%

“…A number of distinct input modalities have been employed to assist in egocentric action recognition. These include depth [7,151], egocentric cues comprising hand [66,129,134] and object regions [66,68,164], head motions [134] and gaze-based saliency maps [129,134], sensor-based modalities [101,114,139] and sound [18,74,167]. Typically, these methods require specialized sensors such as depth cameras, eye trackers, accelerometers, or inertial measurement units for the additional inputs, whereas sound is provided from the built-in microphones of the camera.…”

Section: Spatial Networkmentioning

confidence: 99%

“…Naturally, this refers to the surrounding area and its contents, usually consisting of objects, hands, other people, and the scene background. Being able to examine a perspective of the scene that accumulates this information with clarity allows for improved inference of higher level cues such as the quantification of interactions between hands based on their proximity [103], object-activity relations from associated movements [7], and location identification from the presence of distinctive objects [68,70] as shown in Chapter 3.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Modular Approach for the Detection and Interconnection of Objects, Hands, Locations, and Actions for Egocentric Video Understanding

Kapidis¹

Self Cite

View full text Add to dashboard Cite

Een modulaire aanpak voor de detectie en interconnectie van objecten, handen, locaties, en acties voor het begrijpen van egocentrische video (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op woensdag 9 juni 2021 des middags te 2.15 uur door GEORGIOS KAPIDIS geboren op 26 april 1990 te Thessaloniki, Griekenland

show abstract