ecent advances in the field of machine learning have shown great potential for the automatic recognition of apparent human emotions. In the era of Internet of Things (IoT) and big-data processing, where voice-based systems are well established, opportunities to leverage cutting-edge technologies to develop personalised and humancentered services are genuinely real, with a growing demand in many areas such as education, health, well-being and entertainment. Automatic emotion recognition from speech, which is a key element for developing personalised and human-centered services, has reached a degree of maturity that makes it of broad commercial interest today.However, there are still major limiting factors that prevent a broad applicability of emotion recognition technology. For example, one open challenge is the poor generalisation capabilities of currently used feature extraction techniques to interpret expressions of affect across different persons, contexts, cultures and languages.Since speech and emotion involve interdependent cognitive processes, emotion can be observed both in the spoken words and in the acoustic properties of the speech signal, where many other factors such as gender, age, culture and personality come into play. Even though features derived from speech science have permitted to describe and predict some expressions of affect relatively well, these representations do not encompass all the perceptual cues that humans may sense during an emotional experience. With the advancement of machine (deep) learning, computational methods have been proposed for learning representations from raw speech data. Newly introduced deep representations, although not as easily interpretable as most descriptors from speech science, promise to disentangle many existing issues in affective computing research, such as lack of labelled data, robustness to noise, and domain mismatch [1,2,3].In this contribution, we provide a brief history and critical overview of the different speech representations that have been used in automatic emotion recognition over the years (cf. Figure 1), focusing on how and why the new unsupervised representations in particular can provide major unprecedented benefits in affective computing.Thus, in here, we stay mainly on the topic of speech representations but also mention the new trend to integrate