Building and evaluation of a real room impulse response dataset

Szöke, Igor; Skácel, Miroslav; Mošner, Ladislav; Paliesek, Jakub; Černocký, Jaň

doi:10.1109/jstsp.2019.2917582

Cited by 91 publications

(75 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, these methods will not work well for low-frequency components of the sound, which inevitably introduces significant simulation error at low frequencies under 500Hz compared to accurate wave-based solvers [23]. Far-field ASR experiments that aim to compare the effectiveness of using simulated IRs against real IRs confirm that real IRs are superior in training better ASR systems [9]. The main drawback of geometric acoustic simulation is the absence of low-frequency wave effects such as diffraction [24] and room resonance [25], of which sound diffraction is a less noticeable phenomenon.…”

Section: Related Workmentioning

confidence: 99%

“…Based on this representation, we calculate and extract sub-band EQs for a set of recorded IRs, collected from the BUT Reverb Database (ReverbDB) [9]. The BUT ReverbDB contains 1891 IRs and 9114 background noises (both with some repetitions), recorded in 9 different real-world environments.…”

Section: Equalization Matchingmentioning

confidence: 99%

“…Recorded IRs, simulated IRs, and compensated IRs are pre-split into disjoint subsets containing 773 and 194 IRs for generating farfield training and development sets, while another 242 recorded IRs are reserved for creating the test set. We randomly sample recorded noises from BUT ReverbDB for all augmented datasets and do not extensively experiment with the noise addition strategy, as that is not the focus of this work and has been studied in [6,9]. The final augmented dataset composition and properties are shown in Table 1.…”

Section: Far-field Speech Augmentationmentioning

confidence: 99%

“…This makes it challenging to gather or create large enough far-field ASR training set that works for all situations. Therefore, using data augmentation to enrich existing speech data and approximate far-field speech becomes the most viable solution [6,7,8,9].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Low-Frequency Compensated Synthetic Impulse Responses For Improved Far-Field Speech Recognition

Tang

Meng

Manocha

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a method for generating low-frequency compensated synthetic impulse responses that improve the performance of farfield speech recognition systems trained on artificially augmented datasets. We design linear-phase filters that adapt the simulated impulse responses to equalization distributions corresponding to realworld captured impulse responses. Our filtered synthetic impulse responses are then used to augment clean speech data from Lib-riSpeech dataset [1]. We evaluate the performance of our method on the real-world LibriSpeech test set. In practice, our low-frequency compensated synthetic dataset can reduce the word-error-rate by up to 8.8% for far-field speech recognition.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Equalization Matchingmentioning

confidence: 99%

Section: Far-field Speech Augmentationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Low-Frequency Compensated Synthetic Impulse Responses For Improved Far-Field Speech Recognition

Tang

Meng

Manocha

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The original utterances are sampled at 48kHz, which we down-sample to 16kHz for faster processing. We used noise signals from the BUTReverbDB database [19] to contaminate clean speech utterances. The noise files consist of recordings from silent office and conference rooms.…”

Section: Experimental Set-upmentioning

confidence: 99%

Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN

et al. 2019

View full text Add to dashboard Cite

The quality of speech synthesis systems can be significantly deteriorated by the presence of background noise in the recordings. Despite the existence of speech enhancement techniques for effectively suppressing additive noise under low signal-tonoise (SNR) conditions, these techniques have been neither designed nor tested in speech synthesis tasks where background noise has relatively lower energy. In this paper, we propose a speech enhancement technique based on generative adversarial networks (GANs) which acts as a preprocessing step of speech synthesis. Motivated by the speech enhancement generative adversarial network (SEGAN) approach and recent advances in deep learning, we propose to use Wasserstein GAN (WGAN) with gradient penalty and gated activation functions to the autoencoder network of SEGAN. We studied the impact of the proposed method on a data set consisting of 28 speakers and different noise types with 3 different SNR level. The effectiveness of the proposed method in the context of speech synthesis is demonstrated through the training of WaveNet vocoder. We compare our method against SEGAN. Both subjective and objective metrics confirm that the proposed speech enhancement approach outperforms SEGAN in terms of speech synthesis quality.

show abstract

Multi-style Training for South African Call Centre Audio

Heymans

Davel

Heerden³

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.

show abstract

Building and evaluation of a real room impulse response dataset

Cited by 91 publications

References 40 publications

Low-Frequency Compensated Synthetic Impulse Responses For Improved Far-Field Speech Recognition

Low-Frequency Compensated Synthetic Impulse Responses For Improved Far-Field Speech Recognition

Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN

Multi-style Training for South African Call Centre Audio

Contact Info

Product

Resources

About