VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called "posterior collapse". To avoid this, this paper introduces the class specific distribution for the latent code. But different from cVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, z s and z u , for a single input. We apply two separated encoders to map the input into z s and z u respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code z u represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While z s is assumed to follow the Gaussian mixture distribution in which each component corresponds to a particular class. The parameters for the Gaussian components in z s encoder are optimized by the label supervision in a global stochastic way. In theory, we show that our method is actually equivalent to adding a KL divergence term on the joint distribution of z s and the class label c, and it can directly increase the mutual information between z s and the label c. Our model can also be extended to GAN by adding a discriminator in the pixel domain so that it produces high quality and diverse images. arXiv:1812.09502v4 [cs.CV] 15 Mar 2019 concat 5 × 5 conv, 32, stride 1, lrelu 5 × 5 conv, 64, stride 2, batchnorm, relu fc, 1024, batchnorm, relu 5 × 5 conv, 128, stride 2, lrelu 3 × 3 conv, 128, stride 2, batchnorm, relu 5 × 5 conv, 256, stride 2, batchnorm, relu 5 × 5 conv, 256, stride 2, lrelu 3 × 3 conv, 256, stride 2, batchnorm, relu 5 × 5 conv, 256, stride 1, batchnorm, relu 5 × 5 conv, 256, stride 2, lrelu fc, 1024, batchnorm, relu 5 × 5 conv, 128, stride 2, batchnorm, relu fc, 512, lrelu fc, 100 (for z s ) / 200 (for z u ) 5 × 5 conv, 64, stride 2, batchnorm, relu fc, 1 5 × 5 conv, 32, stride 2, batchnorm, relu 5 × 5 conv, 3, stride 1, tanh Table 3. The network structure for Cifar-10.Discriminator for FaceScrub input x ∈ R 64×64×3 3 × 3 conv, 64, stride 2, lrelu 3 × 3 conv, 128, stride 2, lrelu 3 × 3 conv, 256, stride 1, lrelu 3 × 3 conv, 256, stride 2, lrelu 3 × 3 conv, 512, stride 1, lrelu 3 × 3 conv, 512, stride 2, lrelu 3 × 3 conv, 512, stride 2, lrelu global average pooling fc, 1024, lrelu fc, 1 Table 4. The network structure of discriminator for FaceScrub.