Speech data typically contains task irrelevant information lying within features. Specifically, phonetic information, speaker characteristic information, emotional information and noise are always mixed together and tend to impair one another for certain task. We propose a new type of auto-encoder for feature learning called contrastive auto-encoder. Unlike other variants of auto-encoders, contrastive auto-encoder is able to leverage class labels in constructing its representation layer. We achieve this by modeling two autoencoders together and making their differences contribute to the total loss function. The transformation built with contrastive auto-encoder can be seen as a task-specific and invariant feature learner. Our experiments on TIMIT clearly show the superiority of the feature extracted from contrastive auto-encoder over original acoustic feature, feature extracted from deep auto-encoder, and feature extracted from a model that contrastive auto-encoder originates from.