Speech emotion recognition (SER) systems leverage information derived from sound waves produced by humans to identify the concealed emotions in utterances. Since 1996, researchers have placed effort on improving the accuracy of SER systems, their functionalities, and the diversity of emotions that can be identified by the system. Although SER systems have become very popular in a variety of domains in modern life and are highly connected to other systems and types of data, the security of SER systems has not been adequately explored. In this paper, we conduct a comprehensive analysis of potential cyber-attacks aimed at SER systems and the security mechanisms that may prevent such attacks. To do so, we first describe the core principles of SER systems and discuss prior work performed in this area, which was mainly aimed at expanding and improving the existing capabilities of SER systems. Then, we present the SER system ecosystem, describing the dataflow and interactions between each component and entity within SER systems and explore their vulnerabilities, which might be exploited by attackers. Based on the vulnerabilities we identified within the ecosystem, we then review existing cyber-attacks from different domains and discuss their relevance to SER systems. We also introduce potential cyber-attacks targeting SER systems that have not been proposed before. Our analysis showed that only 30% of the attacks can be addressed by existing security mechanisms, leaving SER systems unprotected in the face of the other 70% of potential attacks. Therefore, we also describe various concrete directions that could be explored in order to improve the security of SER systems.