In woodwind instruments such as a flute, producing a higher-pitched tone than a standard tone by increasing the blowing pressure is called overblowing, and this allows several distinct fingerings for the same notes. This article presents a method that attempts to learn acoustic features that are more appropriate than conventional features such as mel-frequency cepstral coefficients (MFCCs) in detecting the fingering from a flute sound using unsupervised feature learning. To do so, we first extract a spectrogram from the audio and convert it to a mel scale. Then, we concatenate four consecutive mel-spectrogram frames to include short temporal information and use it as a front end for the sparse filtering algorithm. The learned feature is then max-pooled, resulting in a final feature vector for the classifier that has extra robustness. We demonstrate the advantages of the proposed method in a twofold manner: we first visualize and analyze the differences in the learned features between the tones generated by standard and overblown fingerings. We then perform a quantitative evaluation through classification tasks on six selected pitches with up to five different fingerings that include a variety of octave-related and non-octave-related fingerings. The results confirm that the learned features using the proposed method significantly outperform the conventional MFCCs and the residual noise spectrum in every experimental condition for the classification tasks.