Predicting the emotional responses of humans to acoustic features in the surrounding environment has a highly potential of applications in different fields, ranging from videogames, therapeutic use of virtual reality to the emotional design of spaces according to their expected use. In this paper we model the estimation process of the classical emotion characterization parameters (arousal and valence) from sounds. By means of convolutional neural networks and convolutional autoencoders, the model is adjusted for the prediction of these parameters from a standard dataset [1], improving the results obtained in previous literature. The relevance of the work, apart from improving the results due to the use of autoencoders, is that it eliminates the need to compute handcrafted features, thus demonstrating the ability of convolutional neural networks to treat raw information. Other main contributions of the paper is a new way to visualize the errors in the joint estimation of arousal and valence that facilitates the evaluation of the results obtained by the models. Finally, the use of bootstrap to estimate the confidence intervals of the MSE and r2 of Deep Learning models shows that in comparison to non-overlapping samples, overlapping samples introduces performance bias.