Picture response is a crucial aspect in dialogue systems; however, existing systems predominantly focus on text information, often neglecting the use of image data or relying on large-scale language models, which leads to high training costs and slower generation speeds. To address these challenges, this paper introduces the MDSAGM (Selective Attention and Gating Mechanism)model, which comprises a Transformer-based Text Dialogue Response Generator, Text-to-Image Generation, and SAGM (Selective Attention and Gating Mechanism). The model is core lies in combining Selective Attention and Gating Mechanism to enhance generalization and improve model accuracy. The study further explores the contributions of image and text information in multimodal fusion. Compared to other large-scale training models, the proposed model demonstrates computational efficiency , fewer parameters, and shorter response times, making it more lightweight. This dialogue model facilitates the separation of multimodal dialogue parameters from the overall model, enabling better parameter fitting through pre-training with abundant plain text and text-image data. Extensive experiments show that the proposed method achieves promising results in both automatic and manual evaluations, generating information-rich text and image responses with higher accuracy while maintaining a lightweight structure.