Text-driven image stylization aims to synthesize content images with learned textual styles. Recent studies have shown the potential of the diffusion model for producing rich stylizations. However, existing approaches inefficiently control the degree of stylization, which hinders the balance between style and content in generated images. In this paper, we propose a Controllable Text-Driven Image Stylization (ConIS) Framework based on the diffusion model. The proposed framework introduces two modules into the pre-trained text-to-image model. The first is an Unconditional Null-text Inversion (UNTI) module, which optimizes null-text embedding to reduce the bias between inversion and sampling in the diffusion model. Given a content image, this module is able to reconstruct it without semantic guidance. The second is a Null-text Dilution (NTD) module. We design a parameterization mechanism for the semantic intensity of textual conditions, which indirectly controls the degree of stylization through the style degree factor. Finally, we replace the attention maps used in the sampling process with those from the UNTI module to constrain the structure of content images. Experiments have shown that the proposed method enables fine-grained control over the degree of stylization without retraining or fine-tuning the network. Both qualitative and quantitative results indicate that the ConIS framework outperforms state-of-the-art methods in balancing artistic detail and content structure.