Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

Tanberk, Senem; Dagli, Volkan; Gurkan, Mustafa Kagan

doi:10.1109/ubmk52708.2021.9558954

Cited by 5 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study also analyses the experimental outcomes of applying two state-of-the-art pre-trained models to various test conditions and comparing the results. AI may power future video conferencing systems [22]. This study provides an overview of transcription and speech synthesis systems that are based on deep learning.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

An Overview of Speech-To-Text Conversion

Aggarwal

2023

CIML

View full text Add to dashboard Cite

As a result of developments in science and technology, an automatic speech-to-text (STT) conversion system has been available. This system converts spoken words into text that can be read visually. People with trouble hearing may use this technology to communicate in other ways, including understanding voice communication and being able to follow directions using their visual abilities. There are instances when seeing something is more powerful than listening to something, particularly in long-distance communication; thus, speech-to-text conversion is crucial in situations like these. One of the fascinating developments to occur in the twenty-first century is the advent of machine learning. It has evolved from its roots in neurology studies conducted in the 1940s into something like artificial intelligence humans have created. Neural networks, a collection of complex structures, are the basis of machine learning. When combined with optimization techniques, these networks mimic the behaviour of neurons in the human brain and allow a computer to learn from its experiences. Here we explore one of many potential uses for such structures - the analysis of vocal performance in an original study. In particular, we dissect voice recognition systems to determine their inner workings.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

“…In addition, the experimental findings of two cutting-edge pre-trained models are also scrutinized. AI may power future video conferencing systems [23].…”

Section: Literature Reviewmentioning

confidence: 99%

An Overview of Speech-To-Text Conversion

Aggarwal

2023

CIML

View full text Add to dashboard Cite

show abstract

“…(16) . Our knowledge of human speech processes is still incomplete, the quality of text-to-speech is far from naturalsounding (17) . Here the researcher generates and analyze the prosodic information from the recorded Sindhi sounds using the back propagation neural network (18) .…”

Section: Related Workmentioning

confidence: 99%

Text to Speech Synthesizer for Tigrigna Linguistic using Concatenative Based approach with LSTM model

Araya¹,

Alehegn²

2022

IJST

View full text Add to dashboard Cite

Objectives:The purpose of this study is to describe text-to-speech system for the Tigrigna language, using dialog fusion architecture and developing a prototype text-to-speech synthesizer for Tigrigna Language. Methods : The direct observation and review of articles are applied in this research paper to identify the whole strings which are represented the language. Tools used in this work are Mathlab, LPC, and python. In this paper LSTM deep learning model was applied to find out accuracy, precision, recall, and Fscore. Findings: The overall performance of the system in the word level which is evaluated by NeoSpeech tool is found to be 78% which is fruitful. When it comes to the intelligibility and naturalness of the synthesized speech in the sentence level, it is measured in MOS scale and the overall intelligibility and naturalness of the system are found to be 3.28 and 3.27 respectively. Based on the experiment LSTM Deep learning model provides an accuracy of 91.05%, the precision of 78.05%, recall of 86.59 %, and F-score of 83.05% respectively. The values of performance, intelligibility, and naturalness are inspiring and show that diphone speech units are good candidates to develop a fully functional speech synthesizer. Novelty: The researchers come up with the first text to speech LSTM deep learning model for the Tigrigna language which is critical and will be a baseline for other related research to be done for Tigrigna and other languages.

show abstract

“…Among them, ASR has been popularly deployed for voice-enabled information retrieval using artificial intelligence (AI) speakers and chatbots [ 5 , 6 , 7 , 8 ]. It has also been used for the transcription of social media videos [ 9 ] and video conferencing [ 10 , 11 ]. Traditionally, an ASR system is composed of three modules: a feature extractor for representation of the speech signal, an acoustic model for mapping acoustic features to linguistic units, and a language model regarding the grammar, lexicon, etc.…”

Section: Introductionmentioning

confidence: 99%

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Lee

Kim²

2022

Sensors

View full text Add to dashboard Cite

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

show abstract

Deep Learning for Videoconferencing: A Brief Examination of Speech to Text and Speech Synthesis

Cited by 5 publications

References 22 publications

An Overview of Speech-To-Text Conversion

An Overview of Speech-To-Text Conversion

Text to Speech Synthesizer for Tigrigna Linguistic using Concatenative Based approach with LSTM model

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Contact Info

Product

Resources

About