Facial image generation from textual generation is one of the most complicated tasks within the broader topic of Text-to-Image (TTI) synthesis. It is relevant in several fields of scientific research, cartoon and animation development, online marketing, game development, etc. There have been extensive studies on Text-to-Face (TTF) synthesis in the English language. However, the amount of relevant existing work in Bangla is limited and not comprehensive. As the TTF field is not vastly prospected for Bangla language, the objective of this study sets forth to explore the possibilities in the field of Bangla Natural Language Processing and Computer Vision. In this paper, a novel system for generating highly detailed facial images from textual descriptions in the Bangla language is proposed. The proposed system named Mukh-Oboyob consists of two essential components: a pre-trained language model, BanglaBERT, and Stable Diffusion. BanglaBERT, a transformer-based pre-trained text encoder, is a language model used to transform Bangla sentences into vector representations. Stable Diffusion is used by Mukh-Oboyob to generate facial images utilizing the text embedding of the Bangla sentences. Moreover, the work utilizes CelebA Bangla, a modified version of the CelebA dataset consisting of face images, Bangla facial attributes, and Bangla text descriptions to develop and train the proposed system. This paper establishes a system for image synthesis with excellent performance and detailed image outcomes, as evidenced by a comprehensive analysis incorporating both qualitative and quantitative measures, leading to the system under consideration achieving an impressive FID score of 34.6828 and an LPIPS score of 0.4541.