Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement

Hoang, Ngoc Chau; Nguyen, Thi Nhat Linh; Doan, Tuan Kiet; Nguyen, Quoc Cuong

doi:10.1016/j.apacoust.2024.110067

Applied Acoustics

2024

DOI: 10.1016/j.apacoust.2024.110067

|View full text |Cite

Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement

Ngoc Chau Hoang,

Thi Nhat Linh Nguyen,

Tuan Kiet Doan

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

A Feature Integration Network for Multi-Channel Speech Enhancement

Zeng,

Zhang,

Wang

2024

Sensors

View full text Add to dashboard Cite

Multi-channel speech enhancement has become an active area of research, demonstrating excellent performance in recovering desired speech signals from noisy environments. Recent approaches have increasingly focused on leveraging spectral information from multi-channel inputs, yielding promising results. In this study, we propose a novel feature integration network that not only captures spectral information but also refines it through shifted-window-based self-attention, enhancing the quality and precision of the feature extraction. Our network consists of blocks containing a full- and sub-band LSTM module for capturing spectral information, and a global–local attention fusion module for refining this information. The full- and sub-band LSTM module integrates both full-band and sub-band information through two LSTM layers, while the global–local attention fusion module learns global and local attention in a dual-branch architecture. To further enhance the feature integration, we fuse the outputs of these branches using a spatial attention module. The model is trained to predict the complex ratio mask (CRM), thereby improving the quality of the enhanced signal. We conducted an ablation study to assess the contribution of each module, with each showing a significant impact on performance. Additionally, our model was trained on the SPA-DNS dataset using a circular microphone array and the Libri-wham dataset with a linear microphone array, achieving competitive results compared to state-of-the-art models.

show abstract

A Feature Integration Network for Multi-Channel Speech Enhancement

Zeng,

Zhang,

Wang

2024

Sensors

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement

Cited by 1 publication

References 21 publications

A Feature Integration Network for Multi-Channel Speech Enhancement

A Feature Integration Network for Multi-Channel Speech Enhancement

Contact Info

Product

Resources

About