The global spread and use of remote and online learning systems at various educational levels puts forward a number of requirements for existing systems and needs for expansion of functionality. The current problem in Ukraine is the unstable operation of the energy infrastructure due to frequent hostile shelling, so it is problematic for residents of Ukraine to join online classes on time, to listen to lectures by lecturers and teachers completely, to take part in conferences and master classes in full. This determines the need to provide the opportunity of familiarization with educational materials at a convenient time in a form convenient for understanding and mastering. The lecture recording provides access to audio files that are intended for listening, but are not intended for printed reproduction. Therefore, the expansion of existing digital educational platforms with the possibility of forming an annotation (summary, abstract) of a lecture and presenting it in the form of text-and-graphic materials for further use by course students on paper media is an urgent task and can improve the quality assessment of a remote educational resource from the point of view of the content and methodological aspect. The aim of the study is to create a generalized hybrid model of automatic annotation of the speaker’s speech, which provides for the possibility of recognizing the speech, transforming the available data into text and, at the last stage, summarizing the given text, keeping only the important meaningful part of a lecture. The desired aim was achieved due to the creation of a generalized hybrid model of automatic annotation of input audio data, taking into account the effectiveness and features of existing methods of automatic text annotation obtained after converting speech into text. The uniqueness of this study is the use of marker words at the stage of text summarization, as well as the comparison of the efficiency of data processing at different stages of operation of this model when using different hardware. The results of computational experiments on graphics processing units with the Turing architecture showed that when the scope of input data increases by almost 30 times, the time also increases proportionally, but the use of a more powerful graphics processing unit NVIDIA Tesla T4 gives an speedup of more than 2.5 times compared to the graphics processing unit NVIDIA GeForce GTX GPU 1650 Mobile for both English and Ukrainian languages. For texts in the Ukrainian language, the text compression obtained (the ratio of the word count of the input text array to the word count in the resulting annotation) is 89.7%, for English – 94.15%. The proposed use of marker words showed an increase in the logical connection of input information internally, but obliges speakers to use predefined marker words to preserve the structure of the annotation formed.