Performance vs. hardware requirements in state-of-the-art automatic speech recognition

Georgescu, Lucian Puiu; Pappalardo, Alessandro; Cucu, Horia; Blott, Michaela

doi:10.1186/s13636-021-00217-4

Cited by 19 publications

(10 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The number of audio datasets available in Romanian is rather low [25,26], confirming that Romanian is indeed a low-resource language. We underline that existing datasets comprising Romanian speech samples are mainly focused on automatic speech recognition, ignoring the diversity of dialects within the region.…”

Section: Introductionmentioning

confidence: 94%

Romania: university and politics in the 1980s. Case study: the University of Bucharest, the 1980s

Rotaru

2020

JECS

View full text Add to dashboard Cite

The presentation I am submitting to your attention focuses on how the University of Bucharest operated during the 1980’s, a very difficult period for Romania. As to be expected, the University of Bucharest, like the entire Romanian education system, took the full blow of communist experimental policies, reflecting quite accurately the general developments of the political system in Romania in the 1980s. The structure of Bucharest University, the curriculum, acceptance of the students via admission exams as well as the process of assignment of graduates from University of Bucharest to production units in the 1980’s, are aspects of university life which were all affected by profound changes during the period subject to the research, leading to a genuine phenomenon in the Romanian society. The way these changes were reflected in the cultural mindset and the traumas generated by them are all points of interest addressed in my presentation.

show abstract

Section: Introductionmentioning

confidence: 94%

Romania: university and politics in the 1980s. Case study: the University of Bucharest, the 1980s

Rotaru

2020

JECS

View full text Add to dashboard Cite

show abstract

“…The first experiments date back to the 1970s, but the developments in the sphere of parallel and distributed computing architectures, big data and artificial intelligence in the last years have given a great impetus to improve this technology and, thus, its reliability [10][11]. Compared to the past, the accuracy of transcription has actually improved to such a level that, on condition of a clear and clearly defined acoustic source, the accuracy level may well exceed 99%.…”

Section: Analysis Of Last Achievements and Publicationsmentioning

confidence: 99%

Research Into Speech-to-Text Tranfromation Module in the Proposed Model of a Speaker’s Automatic Speech Annotation

Barkovska

2022

ITSSI

View full text Add to dashboard Cite

The subject matter of the article is the module for converting the speaker’s speech into text in the proposed model of automatic annotation of the speaker’s speech, which has become more and more popular in Ukraine in the last two years, due to the active transition to an online form of communication and education as well as conducting workshops, interviews and discussing urgent issues. Furthermore, the users of personal educational platforms are not always able to join online meetings on time due to various reasons (one example can be a blackout), which explains the need to save the speakers’ presentations in the form of audio files. The goal of the work is to elimination of false or corrupt data in the process of converting the audio sequence into the relevant text for further semantic analysis. To achieve the goal, the following tasks were solved: a generalized model of incoming audio data summarization was proposed; the existing STT models (for turning audio data into text) were analyzed; the ability of the STT module to operate in Ukrainian was studied; STT module efficiency and timing for English and Ukrainian-based STT module operation were evaluated. The proposed model of the speaker’s speech automatic annotation has two major functional modules: speech-to-text (STT) і summarization module (SUM). For the STT module, the following models of linguistic text analysis have been researched and improved: for English it is wav2vec2-xls-r-1bz, and for Ukrainian it is Ukrainian STT model (wav2vec2-xls-r-1b-uk-with-lm.Artificial neural networks were used as a mathematical apparatus in the models under consideration. The following results were obtained: demonstrates the reduction of the word error level descriptor by almost 1.5 times, which influences the quality of word recognition from the audio and may potentially lead to obtaining higher-quality output text data. In order to estimate the timing for STT module operation, three English and Ukrainian audio recordings of various length (5s, ~60s and ~240s) were analyzed. The results demonstrated an obvious trend for accelerated obtaining of the output file through the application of the computational power of NVIDIA Tesla T4 graphic accelerator for the longest recording. Conclusions: the use of a deep neural network at the stage of noise reduction in the input file is justified, as it provides an increase in the WER metric by almost 25%, and an increase in the computing power of the graphics processor and the number of stream processors provide acceleration only for large input audio files. The following research of the author is focused on the study of the methods of the obtained text summarization module efficiency.

show abstract

“…Besides the broadcast domain, the significant increase in the ASR field has brought special interests to integrate this technology in many other applications and devices. For instance, considering speech as the most natural means of communication between humans, conversational assistants have acquired great relevance in our daily lives, both in the personal and professional environments [1]. In addition, other main sectors such as Industry, Healthcare or Automotive have already discovered the usability of speech technologies mainly with the use of voice control applications integrated in machines, medical instruments or technical devices.…”

Section: Introductionmentioning

confidence: 99%

“…These interests have triggered special challenges for the current ASR technology, mainly related to the need to optimise and reduce neural models in order to be integrated in devices with low computational power but without a noticeable loss of quality. With the aim of meeting the requirements of embedded systems, the most common optimisation techniques rely on architecture and format optimisation as well as quantisation [1].…”

Section: Introductionmentioning

confidence: 99%

Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

et al. 2022

View full text Add to dashboard Cite

This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully- and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outperformed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency.

show abstract

Performance vs. hardware requirements in state-of-the-art automatic speech recognition

Cited by 19 publications

References 57 publications

Romania: university and politics in the 1980s. Case study: the University of Bucharest, the 1980s

Romania: university and politics in the 1980s. Case study: the University of Bucharest, the 1980s

Research Into Speech-to-Text Tranfromation Module in the Proposed Model of a Speaker’s Automatic Speech Annotation

Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database

Contact Info

Product

Resources

About