One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

Taherian, Hassan; Eskimez, Şefik Emre; Yoshioka, Takuya; Wang, Huaming; Chen, Zhuo; Huang, Xuedong

doi:10.48550/arxiv.2110.10330

Cited by 1 publication

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The array-geometry-agnostic modeling is useful for production. In parallel to this work, we examined its impact on personalized noise reduction [33]. Further investigation in different tasks is desired.…”

Section: Discussionmentioning

confidence: 99%

VarArray: Array-Geometry-Agnostic Continuous Speech Separation

Yoshioka¹,

Wang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an arraygeometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels. The proposed method adapts different elements that were proposed before separately, including transform-averageconcatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way. Large-scale evaluation was performed with two real meeting transcription tasks by using a fully developed transcription system requiring no prior knowledge such as reference segmentations, which allowed us to measure the impact that the continuous speech separation system could have in realistic settings. The proposed model outperformed a previous approach to array-geometry-agnostic modeling for all of the geometry configurations considered, achieving asclite-based speaker-agnostic word error rates of 17.5% and 20.4% for the AMI development and evaluation sets, respectively, in the end-to-end setting using no ground-truth segmentations.

show abstract

Section: Discussionmentioning

confidence: 99%