In the standard bag-of-visual-words (BoVW) model, the burstiness problem of features and the ignorance of high-order information often weakens the discriminative power of image representation. To tackle them, we present a novel framework, named the Salient Superpixel Network, to learn the mid-level image representation. For reducing the impact of burstiness occurred in the background region, we use the salient regions instead of the whole image to extract local features, and a fast saliency detection algorithm based on the Gestalt grouping principle is proposed to generate image saliency maps. In order to introduce the high-order information, we propose a weighted second-order pooling (WSOP) method, which is capable of exploiting the high-order information and further alleviating the impact of burstiness in the foreground region. Then, we conduct experiments on six image classification benchmark datasets, and the results demonstrate the effectiveness of the proposed framework with either the handcrafted or the off-the-shelf CNN features.