Music genre recognition (MGR) plays a fundamental role in the context of music indexing and retrieval. Unlike images, music genres consist of immediate characteristics that are highly diversified with abstractions in different levels. However, most representation learning methods for MGR focus on global features and make decisions from features in the same level. To remedy such defects, we intergrate a convolutional neural network (CNN) with NetVLAD and self-attention to capture the local information across levels and learn their long-term dependencies. A meta classifier is used to make the final MGR classification by learning from aggregated high-level features from different local feature coding networks. Experimental results show that the proposed approach yields higher accuracies than other state-of-the-art models on GTZAN, ISMIR2004, and Extended Ballroom dataset.