An Object is Worth Six Thousand Pictures: The Egocentric, Manual, Multi-image (EMMI) Dataset

Wang, Xiaohan; Eliott, Fernanda M.; Ainooson, James; Palmer, Joshua H.; Kunda, Maithilee

doi:10.1109/iccvw.2017.279

Cited by 6 publications

(7 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In (Abebe & Cavallaro, 2017c), stacked spectrograms of motion patterns extracted from and GTEA Gaze+ 8 . However, since the publication of (Nguyen et al, 2016) other datasets have appeared such as some datasets for manipulated object detection such as the EMMI dataset by Wang et al (2017a). In their paper 735 they also explore other manipulated object recognition datasets, which could be interesting to the reader.…”

Section: From Wearable or First-person Visionmentioning

confidence: 99%

A review on video-based active and assisted living technologies for automated lifelogging

Climent-Pérez

Spinsante

Mihailidis

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

Section: From Wearable or First-person Visionmentioning

confidence: 99%

A review on video-based active and assisted living technologies for automated lifelogging

Climent-Pérez

Spinsante

Mihailidis

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

“…The first dataset is T-LESS [30], which contains 30 objects with no relevant texture. Followed by ToyBox [31] a dataset depicting 360 objects manipulated by a person. Toybox allowed us to evaluate how well the model scales (we selected ToyBox over the iCubWorld dataset [37] since, at the submission time, the latter contains only 28 instances).…”

Section: Methodsmentioning

confidence: 99%

“…In T-LESS [30], the model has to recognize unseen viewpoints. In Toybox [31] we used the hodgepodge videos for training and translations across the three axes for testing. We use ARC as in [4].…”

Section: B Triplet Lossmentioning

confidence: 99%

Learning Discriminative Embeddings for Object Recognition on-the-fly

Lagunes-Fortiz

Damen

Mayol-Cuevas

2019

2019 International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

We address the problem of learning to recognize new objects on-the-fly efficiently. When using CNNs, a typical approach for learning new objects is by fine-tuning the model. However, this approach relies on the assumption that the original training set is available and requires high-end computational resources for training the ever-growing dataset efficiently, which can be unfeasible for robots with limited hardware. To overcome these limitations, we propose a new architecture that: 1) Instead of predicting labels, it learns to generate discriminative and separable embeddings of an object's viewpoints by using a Supervised Triplet Loss, which is easier to implement than current smart mining techniques and the trained model can be applied to unseen objects. 2) Infers an object's identity efficiently by utilizing a lightweight classifier in the features embedding space, this keeps the inference time in the order of milliseconds and can be retrained efficiently when new objects are learned. We evaluate our approach on four real-world images datasets used for Robotics and Computer Vision applications: Amazon Robotics Challenge 2017 by MIT-Princeton, T-LESS, ToyBoX, and CORe50 datasets. Code available at [1].Miguel Lagunes-Fortiz thanks the Mexican Council of Science and Technology (CONACyT) for sponsoring his studies with the scholarship number 686450

show abstract

“…As a first example, data sets cover the sensation during human-environment interaction by measuring (mostly adult) humans directly during performing specific tasks, such as the KIT Motion-Language set for descriptions of whole-body poses (Plappert et al, 2016 ), the Multimodal-HHRI set for personality characterization (Celiktutan et al, 2017 ), and the EASE set for precise motion capturing (Meier et al, 2018 ). Secondly, data sets mimic the human perspective by holding objects in front of a perception device, such as a camera, to capture the diverse and complex but general characteristics of an environment setting, e.g., Core50 (Lomonaco and Maltoni, 2017 ), EMMI (Wang et al, 2017 ), and HOD-40 (Sun et al, 2018 ). And thirdly, humanoid robots are employed for establishing a data set, where multiple modalities are recorded in covering human-like action, i.e., including sensorimotor information, such as the MOD165 set (Nakamura and Nagai, 2017 ) and the Multimodal-HRI set (Azagra et al, 2017 ), or where multiple modalities are gathered from both robot and human in turn-table actions, like in the HARMONIC data set (Newman et al, 2018 ).…”

Section: Related Data Setsmentioning

confidence: 99%

Crossmodal Language Grounding in an Embodied Neurocognitive Model

Heinrich

Yao

Hinz³

et al. 2020

Front. Neurorobot.

View full text Add to dashboard Cite

Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterizing the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilize the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organize hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.

show abstract

An Object is Worth Six Thousand Pictures: The Egocentric, Manual, Multi-image (EMMI) Dataset

Cited by 6 publications

References 19 publications

A review on video-based active and assisted living technologies for automated lifelogging

A review on video-based active and assisted living technologies for automated lifelogging

Learning Discriminative Embeddings for Object Recognition on-the-fly

Crossmodal Language Grounding in an Embodied Neurocognitive Model

Contact Info

Product

Resources

About