Multiple research studies have recently demonstrated deep networks can generate realistic‐looking textures and stylised images from a single texture example. However, they suffer from some drawbacks. Generative adversarial networks are in general difficult to train. Multiple feature variations, encoded in their latent representation, require a priori information to generate images with specific features. The auto‐encoders are prone to generate a blurry output. One of the main reasons is the inability to parameterise complex distributions. The authors present a novel texture generative model architecture extending the variational auto‐encoder approach. It gradually increases the accuracy of details in the reconstructed images. Thanks to the proposed architecture, the model is able to learn a higher level of details resulting from the partial disentanglement of latent variables. The generative model is also capable of synthesising complex real‐world textures. The model consists of multiple separate latent layers responsible for learning the gradual levels of texture details. Separate training of latent representations increases the stability of the learning process and provides partial disentanglement of latent variables. The experiments with proposed architecture demonstrate the potential of variational auto‐encoders in the domain of texture synthesis and also tend to yield sharper reconstruction as well as synthesised texture images.