2023
DOI: 10.1109/lsp.2023.3244428
|View full text |Cite
|
Sign up to set email alerts
|

DeFT-AN: Dense Frequency-Time Attentive Network for Multichannel Speech Enhancement

Abstract: In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgr… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2025
2025

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(3 citation statements)
references
References 70 publications
0
3
0
Order By: Relevance
“…Finally, the performance of the fusion network structure proposed in this paper is compared with commonly used methods reported in the literature, as shown in Table 3. The DeF T AN [14] multi-channel speech enhancement model utilizes a transformer structure based on multi-head attention mechanism, resulting in high computational complexity (MAC/s) . On the other hand, the FaNet [30] network structure implements a temporal convolution structure with relatively large parameter size.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, the performance of the fusion network structure proposed in this paper is compared with commonly used methods reported in the literature, as shown in Table 3. The DeF T AN [14] multi-channel speech enhancement model utilizes a transformer structure based on multi-head attention mechanism, resulting in high computational complexity (MAC/s) . On the other hand, the FaNet [30] network structure implements a temporal convolution structure with relatively large parameter size.…”
Section: Resultsmentioning
confidence: 99%
“…Currently, most neural network-based singlechannel and multi-channel speech enhancements utilize end-to-end neural network structures. Based on their fundamental network structure, they can generally be categorized into convolutional neural network (CNN) [8], [9], [10] recurrent neural network (RNN) [11], [12], encoder-decoder structures [13], transformer structures [14] centered around attention mechanisms [7], among others. Part of speech enhancement based on deep neural network is to estimate the mask of speech time-frequency unit by using neural networks, such as ideal binary masks [15] and ideal ratio masks [16].…”
Section: Introductionmentioning
confidence: 99%
“…Currently, most neural network-based singlechannel and multi-channel speech enhancements utilize end-to-end neural network structures. Based on their fundamental network structure, they can generally be categorized into convolutional neural network (CNN) [8], [9], [10] recurrent neural network (RNN) [11], [12], encoder-decoder structures [13], transformer structures [14] centered around attention mechanisms [7], among others. Part of speech enhancement based on deep neural network is to estimate the mask of speech time-frequency unit by using neural networks, such as ideal binary masks [15] and ideal ratio masks [16].…”
Section: Introductionmentioning
confidence: 99%