Dimensional emotion recognition from physiological signals is a highly challenging task. Common methods rely on hand-crafted features that do not yet provide the performance necessary for real-life application. In this work, we exploit a series of convolutional and recurrent neural networks to predict affect from physiological signals, such as electrocardiogram and electrodermal activity, directly from the raw time representation. The motivation behind this socalled end-to-end approach is that, ultimately, the network learns an intermediate representation of the physiological signals that better suits the task at hand. Experimental evaluations show that, this very first study on end-to-end learning of emotion based on physiology, yields significantly better performance in comparison to existing work on the challenging RECOLA database, which includes fully spontaneous affective behaviors displayed during naturalistic interactions. Furthermore, we gain better understanding of the models' inner representations, by demonstrating that some cells' activations in the convolutional network are correlated to a large extent with hand-crafted features.