Designing effective camera trajectories in virtual 3D environments is a challenging task even for experienced animators. Despite an elaborate film grammar, forged through years of experience, that enables the specification of camera motions through cinematographic properties (framing, shots sizes, angles, motions), there are endless possibilities in deciding how to place and move cameras with characters. Dealing with these possibilities is part of the complexity of the problem. While numerous techniques have been proposed in the literature (optimization‐based solving, encoding of empirical rules, learning from real examples,…), the results either lack variety or ease of control.In this paper, we propose a cinematographic camera diffusion model using a transformer‐based architecture to handle temporality and exploit the stochasticity of diffusion models to generate diverse and qualitative trajectories conditioned by high‐level textual descriptions. We extend the work by integrating keyframing constraints and the ability to blend naturally between motions using latent interpolation, in a way to augment the degree of control of the designers. We demonstrate the strengths of this text‐to‐camera motion approach through qualitative and quantitative experiments and gather feedback from professional artists. The code and data are available at https://github.com/jianghd1996/Camera-control.