We propose a novel method that can learn easy-to-interpret latent representations in realworld image datasets using a VAE-based model by splitting an image into several disjoint regions. Our method performs object-wise disentanglement by exploiting image segmentation and alpha compositing. With remarkable results obtained by unsupervised disentanglement methods for toy datasets, recent studies have tackled challenging disentanglement for real-world image datasets. However, these methods involve deviations from the standard VAE architecture, which has favorable disentanglement properties. Thus, for disentanglement in images of real-world image datasets with preservation of the VAE backbone, we designed an encoder and a decoder that embed an image into disjoint sets of latent variables corresponding to objects. The encoder includes a pre-trained image segmentation network, which allows our model to focus only on representation learning while adopting image segmentation as an inductive bias. Evaluations using real-world image datasets, CelebA and Stanford Cars, showed that our method achieves improved disentanglement and transferability.INDEX TERMS Alpha blend, disentanglement, image segmentation, real-world image, representation learning.