Figure 1: In this work, we explore a new task of text-driven stylized image generation, i.e., directly generating stylized images based on style images and text prompts that describe the content. A simple solution for this task is to combine a text-to-image model (text ⇒ image) and a style transfer network (content image ⇒ stylized image) in a two-stage manner. In contrast, our ControlStyle unifies both stages into one diffusion process, leading to high-fidelity stylized images with better visual quality.