Speech is the most natural way for human communication, carrying the emotional state of the speaker that plays an important role in social interaction. Currently, many instant messaging apps offer the possibility of exchanging voice audios with other users. As a result, a great amount of voice data is generated every day, representing a new challenging approach for speech emotion recognition in real environments. In this study, we investigated emotion recognition from voice messages recorded in the wild using machine-learning algorithms. Unlike most research in this field, which use databases based on emotions evoked in lab environments, simulated by actors or subjectively selected from radio or TV talks, we created an ecological speech dataset with audios from real WhatsApp conversations of 30 Spanish speakers. Four external evaluators labelled each audio in terms of arousal and valence using the Self-Assessment Manikin (SAM) procedure. Pre-processing techniques were applied to the audios and different time and frequency domain features were extracted. Supervised machine learning classifiers were computed using feature reduction and hyper-parameter tuning in order to recognize the affective state of each voice message. The best recognition rate was obtained with Support Vector Machines, achieving 71.37% along the arousal dimension and 70.73% along the valence dimension. These results support the use of emotion recognition models on daily communication apps, helping to understand social human behavior and their interactions with devices in the real world.