We present a variational autoencoder (VAE) learning framework with introspective training for conditional image synthesis, and explore conditional capsule encoder by class-wise mask label insertion for this framework. Our model only consists of encoder (E), generator (G) and classifier (C), where E and G can be adversarially optimized, and C helps to boost conditional generation, improve authenticity and provide generation measures for E and G. Discriminator is not necessary in our framework and its absence makes our model more concise with fewer artifacts and pattern collapse problems. To compensate for the blurry weakness of VAE-like models, feature matching is introduced into loss functions by means of C to offer more reasonable measures between real and synthesized images. Moreover, in consideration of the key role of E in autoencoders as well as the interesting characteristics of capsule structure, conditional capsule encoder is preliminary explored in the image synthesis model. Class labels participate conditional encoding by masking high-level capsules of other categories, and capsule loss for the encoder is added to facilitate conditional synthesis. Experiments on MNIST and Fashion-MNIST data sets show that our model achieves real conditional synthesis performances with better diversity and fewer artifacts. And conditional capsule encoder also reveals interesting synthesis effects.