The multimodal nature of texts to music involves the complex interaction of verbal, auditory, in some cases visual and other components, which determines the functioning of the textual unity. In this regard, the translation of multimodal texts, and in particular the translation of opera librettos, puts forward special requirements for the translator, due to which even the knowledge of the subject area, terminology and style of the source text is not always sufficient to achieve high quality translation. Based on the analysis of the translation principles of one of the most prominent translators of opera librettos Viktor Kolomiitsov (1868-1936) and his translated texts, the article aims to determine the basic requirements to the translation of libretto as a multimodal text, combining the verbal and auditory components in the text unity. It also aims to answer the research question: what exactly is the special feature of multimodal text translation focused on auditory mode, and in case of opera libretto translations - what resources can be used to perform the translation? The material for the study included the texts to L. Beethoven and R. Wagner’s works and their translations. The comparative analysis of V. Kolomiytsov’s translations and several other translators’ approaches allowed us to confirm the assumption about the complexity of auditory multimodal texts translation, identify the main problems of libretto translation, and demonstrate different translation solutions in the transfer of the rhythmic logic of the source text, its syntactic features, alliteration and vocabulary. The procedure of text analysis involved elements of philological and comparative analyses of translations, as well as statistical calculations. It was revealed that the connection of the verbal component with the melodic line, its accents and rhythm predetermines the structure of the text to vocal works. For the translator, this means additional limitations, which are quite difficult to overcome. Equirhythmic translation correlated with music is one of the most difficult types of translation, in which preservation of “form” (rhyme, meter, number of syllables, coincidence of accents, etc.) plays an important role and predetermines the quality of auditory multimodal text translation.