ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054577
|View full text |Cite
|
Sign up to set email alerts
|

Tackling Real Noisy Reverberant Meetings with All-Neural Source Separation, Counting, and Diarization System

Abstract: Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimized through error back-propagation). It was shown that the method could well handle simulated clean (noiseless and a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
24
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(24 citation statements)
references
References 25 publications
0
24
0
Order By: Relevance
“…Some recent neural-network-based diarization methods utilize spatial information by aggregating multi-channel features. For example, online RSAN [20] uses inter-microphone phase difference features in addition to a single-channel magnitude spectrogram. However, the number of channels is fixed due to the network architecture, making the method less flexible.…”
Section: Related Workmentioning
confidence: 99%
“…Some recent neural-network-based diarization methods utilize spatial information by aggregating multi-channel features. For example, online RSAN [20] uses inter-microphone phase difference features in addition to a single-channel magnitude spectrogram. However, the number of channels is fixed due to the network architecture, making the method less flexible.…”
Section: Related Workmentioning
confidence: 99%
“…Experiments on CTS dataset show that the proposed SGSD system can help CSD achieve a good performance on overlap regions. Similar works exist in [25,26]. However, our proposed SGSD framework offers a few major differences: (1) different from the works in [25,26] which use the BLSTM based separation model, we adopt more powerful Conv-TasNet separation model.…”
Section: Introductionmentioning
confidence: 99%
“…However, our proposed SGSD framework offers a few major differences: (1) different from the works in [25,26] which use the BLSTM based separation model, we adopt more powerful Conv-TasNet separation model. It avoids the assumption that the speaker masks are additive and sum to one for each time-frequency bin which is not directly applicable to diarization [16]; (2) we evaluate our methods on realistic mismatched single-channel dataset with different speaking styles from our training set, which is more challenging than handling the simulated single-channel data in [25] and multi-channel dataset with similar speaking styles to training set in [26]; and (3) due to the more challenging situation, we cannot directly use speech separation to attain the diarization results. Therefore, different from the multi-task perspective in [25,26], we emphasize the aspect of enabling speech separation to assist CSD in the proposed SGSD system.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, speech separation models are usually trained based on the signal-level criterion, but it is not necessarily optimal for ASR or speaker diarization. To mitigate such suboptimality, there has been a series of studies for a joint model that combines multiple modules such as joint speech separation and ASR [11,[15][16][17][18][19], joint speaker identification/diarization and speech separation [20][21][22], or joint speech recognition and speaker diarization [12,23,24]. Recently, an end-to-end (E2E) SA-ASR model that jointly performs speaker counting, multi-talker speech recognition, and speaker identification was proposed with a promising result for simulation data [25].…”
Section: Introductionmentioning
confidence: 99%