The automated recognition of human emotions plays an important role in developing machines with emotional intelligence. Major research efforts are dedicated to the development of emotion recognition methods. However, most of the affective computing models are based on images, audio, videos and brain signals. Literature lacks works that focus on utilizing only peripheral signals for emotion recognition (ER), which can be ideally implemented in daily life settings. Therefore, this paper present a framework for ER on the arousal and valence space, based on using multi-modal peripheral signals. The data used in this work were collected during a debate between two people using wearable devices. The emotions of the participants were rated by multiple raters and converted into classes in correspondence to the arousal and valence space. The use of a dynamic threshold for ratings conversion was investigated. An ER model is proposed that uses a Long Short-Term Memory (LSTM)-based architecture for classification. The model uses heart rate (HR), temperature (T), and electrodermal activity (EDA) signals as its inputs with emotional cues. Additionally, a post-processing prediction mechanism is introduced to enhance the recognition performance. The model is implemented to study the use of individual and different combinations of the peripheral signals, as well as utilizing annotations from different ratings. Additionally, it is employed for classification of valence and arousal in an independent and combined fashion, under subject dependent and independent scenarios. The experimental results have justified the efficient performance of the proposed framework, achieving classification accuracy >96% and >93% for the independent and