2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953242
|View full text |Cite
|
Sign up to set email alerts
|

Trainable frontend for robust and far-field keyword spotting

Abstract: Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called perchannel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded no… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
94
1
4

Year Published

2017
2017
2021
2021

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 103 publications
(99 citation statements)
references
References 11 publications
0
94
1
4
Order By: Relevance
“…Definition Per-channel energy normalization (PCEN) [70] has recently been proposed as an alternative to the logarithmic transformation of the mel-spectrogram (logmelspec), with the aim of combining dynamic range compression (DRC, also present in logmelspec) and adaptive gain control (AGC) with temporal integration. AGC is a prior stage to DRC involving a low-pass filter φ T of support T , thus yielding…”
Section: Per-channel Energy Normalizationmentioning
confidence: 99%
“…Definition Per-channel energy normalization (PCEN) [70] has recently been proposed as an alternative to the logarithmic transformation of the mel-spectrogram (logmelspec), with the aim of combining dynamic range compression (DRC, also present in logmelspec) and adaptive gain control (AGC) with temporal integration. AGC is a prior stage to DRC involving a low-pass filter φ T of support T , thus yielding…”
Section: Per-channel Energy Normalizationmentioning
confidence: 99%
“…1. The encoder takes an acoustic feature x[t], t = 1, 2, ..., T , the 40-dimensional Mel-filter bank energies extracted from 16 kHz sampled audio signals with per-channel energy normalization [13] where t is the time frame index, as input and converts it into a hidden representation h[t]. The encoder network consists of a canonical CRNN structure with convolutional and recurrent layers in sequence to capture spectral and temporal characteristics of the acoustic features.…”
Section: Abstractpotting System Descriptionmentioning
confidence: 99%
“…Yet, in recent years, the systematic use of machine learning methods has progressively reduced the need for domain-specific knowledge in several other aspects of auditory perception, including melfrequency spectrum [11] and adaptive gain control [12]. It remains to be known whether octave equivalence can, in turn, be discovered by a machine learning algorithm, instead of being engineered ad hoc.…”
Section: Introductionmentioning
confidence: 99%