Egocentric Activity Recognition on a Budget

Possas, Rafael; Caceres, Sheila Pinto; Ramos, Fábio

doi:10.1109/cvpr.2018.00625

Cited by 38 publications

(25 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The wide use of CNNs in third-person vision was followed by their extensive application in egocentric action and activity recognition [16], [21], [38], [40], [41], [56]. Earlier approaches handled CNN features as an additional modality to handcrafted features [49] or as a feature combination mechanism on previously extracted egocentric features [16].…”

Section: Video Activity Recognitionmentioning

confidence: 99%

“…In [61], [62] optical flow was employed to detect salient regions, which were cropped from the original RGB frames and were given to the network as a second, more focused RGB stream. Other input modalities have been employed including depth [7], [41], egocentric cues comprising hand [63], [64], [65] and object regions [64], [66], [67], head motions [63] and gaze-based saliency maps [63], [65], sensor-based modalities [15], [56], [59] and sound [43], [68], [69]. In [38], [40] object and hand localization and segmentation were intermediate learning steps that forced the network to focus on important egocentric cues prior to action prediction.…”

Section: Video Activity Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Kapidis

Poppe

Veltkamp

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

For egocentric vision tasks such as action recognition, there is a relative scarcity of labeled data. This increases the risk of overfitting during training. In this paper, we address this issue by introducing a multitask learning scheme that employs related tasks as well as related datasets in the training process. Related tasks are indicative of the performed action, such as the presence of objects and the position of the hands. By including related tasks as additional outputs to be optimized, action recognition performance typically increases because the network focuses on relevant aspects in the video. Still, the training data is limited to a single dataset because the set of action labels usually differs across datasets. To mitigate this issue, we extend the multitask paradigm to include datasets with different label sets. During training, we effectively mix batches with samples from multiple datasets. Our experiments on egocentric action recognition in the EPIC-Kitchens, EGTEA Gaze+, ADL and Charades-EGO datasets demonstrate the improvements of our approach over single-dataset baselines. On EGTEA we surpass the current state-of-the-art by 2.47%. We further illustrate the cross-dataset task correlations that emerge automatically with our novel training scheme.

show abstract

Section: Video Activity Recognitionmentioning

confidence: 99%

Section: Video Activity Recognitionmentioning

confidence: 99%

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Kapidis

Poppe

Veltkamp

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Efficient video activity recognition designed for mobile devices has been studied by several research groups. An energy aware training algorithm was proposed in Possas et al (2018), to demonstrate energy efficient video activity recognition on complex problems. In this work, the authors use reinforcement learning to train a network on both video and motion information captured by sensors while penalizing actions that have high energy costs.…”

Section: Related Workmentioning

confidence: 99%

Deep Liquid State Machines With Neural Plasticity for Video Activity Recognition

Soures

Kudithipudi

2019

Front. Neurosci.

View full text Add to dashboard Cite

Real-world applications such as first-person video activity recognition require intelligent edge devices. However, size, weight, and power constraints of the embedded platforms cannot support resource intensive state-of-the-art algorithms. Machine learning lite algorithms, such as reservoir computing, with shallow 3-layer networks are computationally frugal as only the output layer is trained. By reducing network depth and plasticity, reservoir computing minimizes computational power and complexity, making the algorithms optimal for edge devices. However, as a trade-off for their frugal nature, reservoir computing sacrifices computational power compared to state-of-the-art methods. A good compromise between reservoir computing and fully supervised networks are the proposed deep-LSM networks. The deep-LSM is a deep spiking neural network which captures dynamic information over multiple time-scales with a combination of randomly connected layers and unsupervised layers. The deep-LSM processes the captured dynamic information through an attention modulated readout layer to perform classification. We demonstrate that the deep-LSM achieves an average of 84.78% accuracy on the DogCentric video activity recognition task, beating state-of-the-art. The deep-LSM also shows up to 91.13% memory savings and up to 91.55% reduction in synaptic operations when compared to similar recurrent neural network models. Based on these results we claim that the deep-LSM is capable of overcoming limitations of traditional reservoir computing, while maintaining the low computational cost associated with reservoir computing.

show abstract

“…The goal of egocentric vision is to analyze the visual information provided by wearable cameras, which have the capability to acquire images from a first person point-of-view. The analysis of these images provides information about the behavior of the user, useful for several complementary topics like social interactions (Aghaei et al, 2018), scene understanding (Singh et al, 2016), time-space-based localization (Yao et al, 2018), action (Fathi et al, 2011;Possas et al, 2018) or activity recognition (Iwashita et al, 2014;Cartas et al, 2017), or nutritional habits analysis (Bolaños et al, 2018b), among others. Thus, enabling us to understand the whole story and behavior of the users behind the pictures (i.e.…”

Section: Captioning Visual Contentmentioning

confidence: 99%

Interactivity, Adaptation and Multimodality in Neural Sequence-to-sequence Learning

Abril¹

View full text Add to dashboard Cite

Chapter 1 frames the scope of this thesis, introducing the pattern recognition field and, more specifically, the MT field. It reviews the different historical approaches devised to tackle this problem. Moreover, it sets the experimental framework followed in this thesis and the main scientific objectives. Chapter 2 describes the mathematical model that represents the core of the thesis: neural networks. It addresses the parameter estimation process, describes different neural architectures and a number of techniques used along the thesis to improve the generalization capability of the model. Chapter 3 introduces the neural machine translation technology, describing the most common architectures and decoding process. Moreover, it reviews different aspects relating the NMT field that nowadays receive the attention of the research community. It also compares NMT in the different translation tasks that will be tackled in the thesis. Chapter 4 introduces the interactive-predictive pattern recognition field, that aims to minimize the effort spent by the user while supervising an automatic system. It proposes the application of this theoretical framework to the neural technology, introducing alternative interaction protocols. After that, these interactivepredictive neural systems are evaluated. Chapter 5 describes the adaptation of NMT systems via online learning techniques. After receiving a corrected sample, the system can be updated to include this new knowledge. Here are described the methods to perform this adaptation and introduces two novel alternatives. In addition, an active learning framework for neural systems is proposed, useful for a situation that requires the translation of large amounts of data. All these scenarios are thoroughly evaluated in a variety of conditions, including a user evaluation involving professional post-editors. Chapter 6 departs from the MT problem to tackle different multimodal sequenceto-sequence tasks. More precisely, it is focused on the generation textual descriptions of videos. These techniques are also applied to the captioning of daily events, captured with an egocentric camera. Finally, the interactive-predictive framework described in Chapter 4 is applied to these multimodal systems. Chapter 7 draws the main conclusions of the thesis, describing the scientific contributions and publications derived from it and traces several lines of future research. These chapters are complemented by two appendices. Appendix A describes NMT-Keras, an open-source library developed to build neural models, that has been used to carry out most of the experiments described in the thesis. In Appendix B we provide the results of a survey carried out in the scope of Chapter 5.

show abstract

Egocentric Activity Recognition on a Budget

Cited by 38 publications

References 31 publications

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Deep Liquid State Machines With Neural Plasticity for Video Activity Recognition

Interactivity, Adaptation and Multimodality in Neural Sequence-to-sequence Learning

Contact Info

Product

Resources

About