In the case of emotion recognition for wild input signals with large variances, multiple sources of noise can challenge the machine's ability to learn approximate ground truth. There are copious studies on recognizing characters' affective expressions directly through face, speech, and text. However, there are few researches on the prediction of a characters' emotions from their watched contents. Therefore, in this paper, we propose a hybrid fusion model, called deep graph fusion, to leverage the combination of visual-audio representations for predicting viewers' evoked expressions from videos. The proposed system includes four steps. First, we extract features for visual and audio modalities for each 30-second segment using CNN-based pre-trained models to learn their salient representations. Then, we reconstruct these features into graph outlines, and conduct node embedding using graph convolutional networks. In the third steps, we propose various fusion modules to combine the graph representations from visual and audio branches. Finally, the fused features are applied to Sigmoid activation in order to estimate the evoked scores for all emotional classes. Moreover, in order to enhance the overall performance, we propose a semantic embedding loss to learn the semantic meaning of textual emotions. We evaluate the proposed method using the Evoked Expression from Videos (EEV) database on both the validation and test sets. The experimental results demonstrate that the proposed algorithm outperforms all baseline models.