Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super‐resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three‐scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two‐level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three‐scale decoding features from the decoder and warp them for a set of three‐scale pre‐warped maps. The authors fuse them into the synthesis network to generate high‐quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.