This paper is concerned with the efficiency of sparse coding based audio-word feature extraction system. In particular, we have defined and added the concept of early and late temporal pooling to the classic sparse coding based audio-word feature extraction pipeline, and we have tested them on the genre tags subset of the CAL10k data set. We define temporal pooling as any functions that are able to transforms the input time series representation into a more temporally compact representation. Under this definition, we have examined the following two temporal pooling functions for improving the feature extraction's efficiency, and they are: Early Texture Window Pooling and Multiple Frame Representation. Early texture window pooling tremendously boost the efficiency by compromising the retrieving accuracy, while multiple frame representation slightly improve both the feature extracting efficiency and retrieving accuracy. Overall, our best feature extraction setup achieves 0.202 in mean average precision on the genre tags subset of the CAL10k data set.