Emotion plays an important role in our daily lives. Emotional individuals can affect the performance of a company, the harmony of a family, the wellness or growth (physical, mental, and spiritual) of a child etc. It renders a wide range of impacts. The existing works on emotion detection from facial expressions differ from the voice. It is deduced that the facial expression is captured on the face externally, whereas the voice is captured from the air passes through the vocal folds internally. Both captured output models may very much deviate from each other. This paper studies and analyses a person's emotion through dual models -- facial expression and voice separately. The proposed algorithm uses a Convolutional Neural Network (CNN) with 2-dimensions convolutional layers for facial expression and 1-Dimension convolutional layers for voice. Feature extraction is done via face detection, and Mel-Spectrogram extraction is done via voice. The network layers are fine-tuned to achieve the higher performance of the CNN model. The trained CNN models can recognize emotions from the input videos, which may cover single or multiple emotions from the facial expression and voice perspective. The experimented videos are clean from the background music and environment noise and contain only a person's voice. The proposed algorithm achieved an accuracy of 62.9% through facial expression and 82.3% through voice.