Affective image analysis aims to understand the sentiment of different images. The challenge is to develop a discriminative representation that bridges the affective gap between low-level features and high-level emotions. Most existing studies bridge the gap by designing deep models carefully to learn global representations in one shot directly or identify image emotion by extracting features at different levels in the model. They ignore that both local regions of an image and relationships between them impact emotional representation learning. This paper develops an affective image analysis method based on the aesthetic fusion hybrid attention network (AFHA). A modular hybrid attention block is designed to extract image emotion features and model long-range dependencies of images. By stacking hybrid attention blocks in ResNet-style, we obtain an affective representation backbone. Furthermore, considering that image emotion is inseparable from aesthetics, we employ a modified ResNet to extract image aesthetics. Finally, through a fusion strategy, the image's emotion is considered with the aesthetics conveyed. Experiments demonstrate the close relationship between emotion and aesthetics, and our plan has an excellent competitive effect compared with existing methods on the image sentiment analysis dataset.