Leveraging Sound Localization to Improve Continuous Speaker Separation

Taherian, Hassan; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang

doi:10.1109/icassp48485.2024.10446934

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

DOI: 10.1109/icassp48485.2024.10446934

|View full text |Cite

Leveraging Sound Localization to Improve Continuous Speaker Separation

Hassan Taherian,

Ashutosh Pandey,

Daniel Wong

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Multi-Channel Conversational Speaker Separation via Neural Diarization

Taherian,

Wang

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Multi-Channel Conversational Speaker Separation via Neural Diarization

Taherian,

Wang

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Combined Keyword Spotting and Localization Network Based on Multi-Task Learning

Ko,

Kim,

Kim

2024

Mathematics

View full text Add to dashboard Cite

The advent of voice assistance technology and its integration into smart devices has facilitated many useful services, such as texting and application execution. However, most assistive technologies lack the capability to enable the system to act as a human who can localize the speaker and selectively spot meaningful keywords. Because keyword spotting (KWS) and sound source localization (SSL) are essential and must operate in real time, the efficiency of a neural network model is crucial for memory and computation. In this paper, a single neural network model for KWS and SSL is proposed to overcome the limitations of sequential KWS and SSL, which require more memory and inference time. The proposed model uses multi-task learning to utilize the limited resources of the device efficiently. A shared encoder is used as the initial layer to extract common features from the multichannel audio data. Subsequently, the task-specific parallel layers utilize these features for KWS and SSL. The proposed model was evaluated on a synthetic dataset with multiple speakers, and a 7-module shared encoder structure was identified as optimal in terms of accuracy, direction of arrival (DOA) accuracy, DOA error, and latency. It achieved a KWS accuracy of 94.51%, DOA error of 12.397°, and DOA accuracy of 89.86%. Consequently, the proposed model requires significantly less memory owing to the shared network architecture, which enhances the inference time without compromising KWS accuracy, DOA error, and DOA accuracy.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Leveraging Sound Localization to Improve Continuous Speaker Separation

Cited by 2 publications

References 28 publications

Multi-Channel Conversational Speaker Separation via Neural Diarization

Multi-Channel Conversational Speaker Separation via Neural Diarization

Combined Keyword Spotting and Localization Network Based on Multi-Task Learning

Contact Info

Product

Resources

About