Based on ELAN multimodal discourse analysis software, this paper constructs a multimodal Russian translation model based on the machine translation model with visual grammar and multimodal discourse analysis as the theoretical basis. To address the issue of missing semantics caused by insufficient input information at the source of real-time translation, the model uses images as auxiliary modalities. The real-time Russian translation model is constructed using the wait-k strategy and the concept of multimodal self-attention. Experiments and analysis are carried out on the Multi30k training set, and the generalization ability and translation effect of the model are finally evaluated with the test set. The results show that by applying multimodal discourse analysis to Russian translation, the three translation evaluation indexes of BLEU, METEOR, and TER are improved by 1.3, 1.0, and 1.4 percentage points, respectively, and the phenomenon of phantom translation is effectively reduced.