The paper deals with the formation of contextual grammars in the methods of complex scene recognition. It proposes the use of multi-level grammar, which includes the task of syntactic analysis of image sequences and the task of syntactic analysis of a scene taking into account the multi-level movement of objects. It is shown that the formation of grammar, describing both the structural information of the image and the interaction of images, is associated with the need to develop an algorithm to output grammar on a given set of dynamic images, which represent a learning sample. As a result of training, structural descriptions of images and descriptions of their relations are formed and later used for syntactic analysis of complex structure events. It is postulated that for dynamic scenes with multi-level movement and complex structure, which is constantly changing in time, it is reasonable to apply context grammar rules, and in this connection arises the concept of multi-level context grammar. Some basic principles of the theory of formal grammars inherent in structural methods of recognition are described.