In recent years, audio speech has become more and more popular and often used in modern human-robot interfaces. Such natural form of communication is highly appreciated by users. There is no doubt that in the nearest future, alongside with the technology development, we will encounter the development of such "native" human-robot interfaces. In this paper, we propose the architecture and develop the software-hardware complex designed for automatic speech recognition with a dictionary of small and medium size and to be used in robots. A distinctive feature of the developed software-hardware complex is the presence of an audiovisual speech synchronization module, which allows both (1) to detect a speech signal in audio data and (2) to take into account the natural asynchrony between acoustic and visual speech. Based on this, it is possible (3) to synchronize the speech sections of audio and video streams in time. Another distinctive feature is the presence of a modality combining module, which allows (1) to combine informative data from audio and video signals and (2) to adjust the weights of each modality depending on the SNR level, which allows achieving optimal recognition accuracy even in acoustically noisy conditions.