This paper proposes a gaze-communicative stuffed-toy robot system with joint attention and eye-contact reactions based on ambient gaze-tracking. For free and natural interaction, we adopted our remote gaze-tracking method. Corresponding to the user's gaze, the gaze-reactive stuffed-toy robot is designed to gradually establish 1) joint attention using the direction of the robot's head and 2) eye-contact reactions from several sets of motion. From both subjective evaluations and observations of the user's gaze in the demonstration experiments, we found that i) joint attention draws the user's interest along with the user-guessed interest of the robot, ii) "eye contact" brings the user a favorable feeling for the robot, and iii) this feeling is enhanced when "eye contact" is used in combination with "joint attention." These results support the approach of our embodied gazecommunication model.
Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence mapping, which consists of audio as input and video as output. To synthesize talking heads, we use SAVEE database which consists of videos of multiple sentences speeches recorded from front of face. Audiovisual data can be considered as two parallel sequential data of audio and visual features and it is composed of continuous value. Thus, audio and visual features of our dataset are represented by a regression model. In this research, the regression model is trained with long short-term memory (LSTM) by minimizing mean squared error (MSE). Then, audio features are used as input and visual features are used as target of LSTM. Thereby, talking heads are synthesized from speech. Our method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phonemes as audio features only can synthesize neutral expression talking heads.With SAVEE database, we achieved the minimum MSE 17.03 on our testing dataset. In experiment, we use mel-frequency cepstral coefficient (MFCC), ΔMFCC and Δ 2 MFCC with energy as audio feature and active appearance model (AAM) on entire face region as visual feature.
In this paper, we present robust methods for automatically segmenting phases in a specified surgical workflow by using latent Dirichlet allocation (LDA) and hidden Markov model (HMM) approaches. More specifically, our goal is to output an appropriate phase label for each given time point of a surgical workflow in an operating room. The fundamental idea behind our work lies in constructing an HMM based on observed values obtained via an LDA topic model covering optical flow motion features of general working contexts, including medical staff, equipment, and materials. We have an awareness of such working contexts by using multiple synchronized cameras to capture the surgical workflow. Further, we validate the robustness of our methods by conducting experiments involving up to 12 phases of surgical workflows with the average length of each surgical workflow being 12.8 minutes. The maximum average accuracy achieved after applying leave-one-out cross-validation was 84.4%, which we found to be a very promising result.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.