In this paper, we propose an integrated approach to robot vision: a key frame-based skeleton feature estimation and action recognition network (KFSENet) that incorporates action recognition with face and emotion recognition to enable social robots to engage in more personal interactions. Instead of extracting the human skeleton features from the entire video, we propose a key frame-based approach for their extraction using pose estimation models. We select the key frames using the gradient of a proposed total motion metric that is computed using dense optical flow. We use the extracted human skeleton features from the selected key frames to train a deep neural network (i.e., the double-feature double-motion network (DDNet)) for action recognition. The proposed KFSENet utilizes a simpler model to learn and differentiate between the different action classes, is computationally simpler and yields better action recognition performance when compared with existing methods. The use of key frames allows the proposed method to eliminate unnecessary and redundant information, which improves its classification accuracy and decreases its computational cost. The proposed method is tested on both publicly available standard benchmark datasets and self-collected datasets. The performance of the proposed method is compared to existing state-of-the-art methods. Our results indicate that the proposed method yields better performance compared with existing methods. Moreover, our proposed framework integrates face and emotion recognition to enable social robots to engage in more personal interaction with humans.
This paper presents the hardware and control software architectures of an intelligent humanoid robot. The robot has a mobile base consists of three omnidirectional wheels that allows it to move freely with three degree-of-freedom (DOF), two 6-DOF arms and 3-DOF neck and head that allows it to perform most of the common movements of human. Detail hardware components are given to show our mechanical design solution of the robot. The control software structure of the robotic system is constructed in the robot operating system (ROS) framework which is mainly used as a bridge to connect the control modules and various peripheral devices to ease our robot system task management. We have also shown the detail structure of the robot control system which consists of all key control modules which enable the robot functions: from upper level with AI-based techniques such as image and sound processing to middle level with the robot motion controllers and then to the lower level with the management of atuators and sensors. The proposed architecture is being developed and tested on a real humanoid robot prototype called Bonbon to support Enghlish teaching in elementary schools.
This letter presents a novel error correction module using a Bag-of-Words model and deep neural networks to improve the accuracy of cloud-based speech-to-text services on recognition tasks of non-native speakers with foreign accents. The Bag-of-Words model transforms text into input vectors for the deep neural network, which is trained using typical sentences in the curriculum for elementary schools in Vietnam and the Google Speech-to-Text data for those sentences. The trained network is then used for real-time error correction on a humanoid robot and yields 18% better accuracy than Google Speech-to-Text.
This article presents the design of the mechatronic system for an intelligent humanoid robot, which is employed for teaching the English language. The robot’s appearance looks like a boy, at 1.2 m tall and 40 kg weight. The robot consists of an upper-body with 21 degrees of freedom, a head, two arms, two hands, a ribcage; and a mobile platform with three omnidirectional wheels. The control system consists of a computer that controls the entire operation of the robot, including motion planning, voice recognition and synchronization, face recognition, gestures, receiving commands from the remote control and monitoring station, receiving signals from microphones, cameras, receiving and sending signals to the mobile module controller and the upper body controller. Microphones, speakers and cameras are located at the head and chest of the robot to perform voice communication and image acquisition functions. A touch screen is arranged in front of the robot’s chest allowing the robot to interact with people and display the necessary information. The robot can communicate with people by voice, perform operations such as greetings, expressing emotions, performing dances, singing, applications for supporting English language teaching in primary schools and has extensible for many other practical applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.