To further extend the applicability of wearable sensors in various domains such as mobile health systems and the automotive industry, new methods for accurately extracting subtle physiological information from these wearable sensors are required. However, the extraction of valuable information from physiological signals is still challenging—smartphones can count steps and compute heart rate, but they cannot recognize emotions and related affective states. This study analyzes the possibility of using end-to-end multimodal deep learning (DL) methods for affect recognition. Ten end-to-end DL architectures are compared on four different datasets with diverse raw physiological signals used for affect recognition, including emotional and stress states. The DL architectures specialized for time-series classification were enhanced to simultaneously facilitate learning from multiple sensors, each having their own sampling frequency. To enable fair comparison among the different DL architectures, Bayesian optimization was used for hyperparameter tuning. The experimental results showed that the performance of the models depends on the intensity of the physiological response induced by the affective stimuli, i.e., the DL models recognize stress induced by the Trier Social Stress Test more successfully than they recognize emotional changes induced by watching affective content, e.g., funny videos. Additionally, the results showed that the CNN-based architectures might be more suitable than LSTM-based architectures for affect recognition from physiological sensors.