In the context of Human-Robot Interaction (HRI), emotional understanding is becoming more popular because it turns robots more humanized and user-friendly. Giving a robot the ability to recognize emotions has several difficulties due to the limits of the robots’ hardware and the real-world environments in which it works. In this sense, an out-of-robot approach and a multimodal approach can be the solution. This paper presents the implementation of a previous proposed multi-modal emotional system in the context of social robotics; that works on a server and bases its prediction in four modalities as inputs (face, posture, body, and context features) captured through the robot’s sensors; the predicted emotion triggers some robot behavior changes. Working on a server allows overcoming the robot’s hardware limitations but gaining some delay in the communication. Working with several modalities allows facing complex real-world scenarios strongly and adaptively. This research is focused on analyzing, explaining, and arguing the usability and viability of an out-of-robot and multimodal approach for emotional robots. Functionality tests were applied with the expected results, demonstrating that the entire proposed system takes around two seconds; delay justified on the deep learning models used, which are improvable. Regarding the HRI evaluations, a brief discussion about the remaining assessments is presented, explaining how difficult it can be a well-done evaluation of this work. The demonstration of the system functionality can be seen at https://youtu.be/MYYfazSa2N0.