The paper introduces a multimodal affective dataset named VREED (VR Eyes: Emotions Dataset) in which emotions were triggered using immersive 360° Video-Based Virtual Environments (360-VEs) delivered via Virtual Reality (VR) headset. Behavioural (eye tracking) and physiological signals (Electrocardiogram (ECG) and Galvanic Skin Response (GSR)) were captured, together with self-reported responses, from healthy participants (n=34) experiencing 360-VEs (n=12, 1--3 min each) selected through focus groups and a pilot trial. Statistical analysis confirmed the validity of the selected 360-VEs in eliciting the desired emotions. Preliminary machine learning analysis was carried out, demonstrating state-of-the-art performance reported in affective computing literature using non-immersive modalities. VREED is among the first multimodal VR datasets in emotion recognition using behavioural and physiological signals. VREED is made publicly available on Kaggle1. We hope that this contribution encourages other researchers to utilise VREED further to understand emotional responses in VR and ultimately enhance VR experiences design in applications where emotional elicitation plays a key role, i.e. healthcare, gaming, education, etc.
Extracting specific attributes of a face within an image, such as emotion, age, or head pose has numerous applications. As one of the most widely used vision-based attribute extraction models, HPE (Head Pose Estimation) models have been extensively explored. In spite of the success of these models, the pre-processing step of cropping the region of interest from the image, before it is fed into the network, is still a challenge. Moreover, a significant portion of the existing models are problem-specific models developed specifically for HPE. In response to the wide application of HPE models and the limitations of existing techniques, we developed a multi-purpose, multi-task model to parallelize face detection and pose estimation (i.e., along both axes of yaw and pitch). This model is based on the Mask-RCNN object detection model, which computes a collection of mid-level shared features in conjunction with some independent neural networks, for the detection of faces and the estimation of poses. We evaluated the proposed model using two publicly available datasets, Prima and BIWI, and obtained MAEs (Mean Absolute Errors) of 8.0 ± 8.6, and 8.2 ± 8.1 for yaw and pitch detection on Prima, and 6.2 ± 4.7, and 6.6 ± 4.9 on BIWI dataset. The generalization capability of the model and its cross-domain effectiveness was assessed on the publicly available dataset of UTKFace for face detection and age estimation, resulting a MAE of 5.3 ± 3.2. A comparison of the proposed model's performance on the domains it was tested on reveals that it compares favorably with the state-of-the-art models, as demonstrated by their published results. We provide the source code of our model for public use at: https://github.com/kahroba2000/MTL_MRCNN.
The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.