Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Loweimi, Erfan; Yue, Zhengjun; Bell, P. J.; Renals, Steve; Cvetković, Zoran

doi:10.1109/taslp.2023.3237167

Cited by 7 publications

(4 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For acoustic models, a multi-stream network based on Fourier transform uses two independent branches to process the real and imaginary parts and then fuse them [54]. SHC features are used as inputs to this framework, called MS.…”

Section: Baselinesmentioning

confidence: 99%

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Yan,

Huang

2024

Applied Sciences

View full text Add to dashboard Cite

To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to effectively deal with the interference of reverberation and noise in speech signals. The unique properties of complex-valued signals are exploited to fully capture inherent features and rich information is preserved in the complex field. An attention mask module is designed to generate distinct masks for selectively focusing and masking based on the input. After that, the multi-scale fusion block efficiently captures multi-scale spatial features by stacking complex-valued convolutional layers with small size kernels, and reduces the module complexity through special branching operations. Experimental results demonstrate that the model achieves significant improvements over other methods for speaker localization in reverberant and noisy environments. It provides a new solution for DOA estimation for acoustic sources in different scenarios, which has significant theoretical and practical implications.

show abstract

Section: Baselinesmentioning

confidence: 99%

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Yan,

Huang

2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Finally, to put the reported numbers in context, Table 3 demonstrates the performance of the proposed systems along with previous studies on TORGO [20,24,33].…”

Section: Multi-stream Adsr Systemsmentioning

confidence: 99%

“…Raw signal representations such as raw waveform [14][15][16][17], raw magnitude [18], raw phase [19], raw real and imaginary parts [20] and, raw source and filter components [21] have been recently applied in acoustic modelling for typical speech. Compared with the task-blind hand-crafted features such as MFCC, the raw representations are richer information-wise.…”

Section: Introductionmentioning

confidence: 99%

Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra

Yue¹,

Loweimi²,

Cvetković³

2023

Interspeech 2023

View full text Add to dashboard Cite

In this paper, we explore the effectiveness of deploying the raw phase and magnitude spectra for dysarthric speech recognition, detection and classification. In particular, we scrutinise the usefulness of various raw phase-based representations along with their combinations with the raw magnitude spectrum and filterbank features. We employed single and multi-stream architectures consisting of a cascade of convolutional, recurrent and fully-connected layers for acoustic modelling. Furthermore, we investigate various configurations and fusion schemes as well as their training dynamics. In addition, the accuracies of the raw phase and magnitude based systems in the detection and classification tasks are studied and discussed. We report the performance on the UASpeech and TORGO dysarthric speech databases and for different severity levels. Our best system achieved WERs of 31.2% and 9.1% for dysarthric and typical speech on TORGO and 30.2% on UASpeech, respectively.

show abstract

“…Their experimental results based on the TORGO database showed that parametric CNNs outperform non-parametric CNNs, with an average WER reaching up to 35.9% tested on dysarthric speech. Loweimi, et al [103] used the raw real and imaginary parts of the Fourier transform of speech signals to investigate the multi-stream acoustic modelling approach. In their framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and they combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully connected layers of neural networks.…”

Section: Deep Learning Technologies Of Asr For Dysarthric Speechmentioning

confidence: 99%

A survey of technologies for automatic Dysarthric speech recognition

Qian,

Xiao,

2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Speakers with dysarthria often struggle to accurately pronounce words and effectively communicate with others. Automatic speech recognition (ASR) is a powerful tool for extracting the content from speakers with dysarthria. However, the narrow concept of ASR typically only covers technologies that process acoustic modality signals. In this paper, we broaden the scope of this concept that the generalized concept of ASR for dysarthric speech. Our survey discussed the systems encompassed acoustic modality processing, articulatory movements processing and audio-visual modality fusion processing in the application of recognizing dysarthric speech. Contrary to previous surveys on dysarthric speech recognition, we have conducted a systematic review of the advancements in this field. In particular, we introduced state-of-the-art technologies to supplement the survey of recent research during the era of multi-modality fusion in dysarthric speech recognition. Our survey found that audio-visual fusion technologies perform better than traditional ASR technologies in the task of dysarthric speech recognition. However, training audio-visual fusion models requires more computing resources, and the available data corpus for dysarthric speech is limited. Despite these challenges, state-of-the-art technologies show promising potential for further improving the accuracy of dysarthric speech recognition in the future.

show abstract

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform

Cited by 7 publications

References 69 publications

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra

A survey of technologies for automatic Dysarthric speech recognition

Contact Info

Product

Resources

About