ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053898
|View full text |Cite
|
Sign up to set email alerts
|

Depthwise-STFT Based Separable Convolutional Neural Networks

Abstract: In this paper, we propose a new convolutional layer called Depthwise-STFT Separable layer that can serve as an alternative to the standard depthwise separable convolutional layer. The construction of the proposed layer is inspired by the fact that the Fourier coefficients can accurately represent important features such as edges in an image. It utilizes the Fourier coefficients computed (channelwise) in the 2D local neighborhood (e.g., 3 × 3) of each position of the input map to obtain the feature maps. The Fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…Note that the number of frequency variables are reduced accordingly. For example, ST-STFT with K = 13 in Figure 3 uses four unique frequency variables in the spatial dimensions (i.e., [63].…”
Section: Variations Of Stft Blocksmentioning
confidence: 99%
“…Note that the number of frequency variables are reduced accordingly. For example, ST-STFT with K = 13 in Figure 3 uses four unique frequency variables in the spatial dimensions (i.e., [63].…”
Section: Variations Of Stft Blocksmentioning
confidence: 99%
“…Most of deep learning architectures for speech enhancement are formulated in the full-band time-frequency (T-F) representation of the speech mixture (Tan and Wang 2021;Hu et al 2020;Zhao and Wang 2020). By using short-time Fourier transform (STFT), the state-of-the-art methods estimate the spectrogram of the desired speech signal from the mixture spectrogram (Kumawat and Raman 2020) (Pandey and Wang 2020). However, it has been confirmed that the background noise is uniformly distributed at the full band and human speech occupies in the lower frequency-band (Li, Sun, and Naqvi 2021).…”
Section: Introductionmentioning
confidence: 99%
“…They then updated their model to QuartzNet [1], replacing traditional 1D-convolution layers by 1D time-channel separable convolutional layers. Time-depth separable convolutions [8,9,10] are designed to reduce the number of parameters in traditional convolutions while keeping the receptive field large. The original QuartzNet model has K × C + C 2 parameters, where K is the kernel size and C is the channel dimension.…”
Section: Introductionmentioning
confidence: 99%