2021
DOI: 10.48550/arxiv.2106.04129
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement

Abstract: The presence of multiple talkers in the surrounding environment poses a difficult challenge for real-time speech communication systems considering the constraints on network size and complexity. In this paper, we present Personalized Percep-Net, a real-time speech enhancement model that separates a target speaker from a noisy multi-talker mixture without compromising on complexity of the recently proposed PercepNet. To enable speaker-dependent speech enhancement, we first show how we can train a perceptually m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 25 publications
0
6
0
Order By: Relevance
“…Meanwhile, for speech separation, i.e., separating the voice of a target speaker from multi-speaker signals, some works (Wang et al 2018;Mun et al 2020) introduce reference speeches (spoken by the same speaker as that of the target speech) in the form of global embeddings since the global embeddings of target speech and reference speech are correlated. Recently, Giri et al (Giri et al 2021) proposed to utilize the speaker identity embeddings extracted from a clean reference for both speech separation and enhancement. However, there is still no work introducing reference speeches for speech enhancement by exploring local correlations.…”
Section: Semantics Guided Speech Processingmentioning
confidence: 99%
See 2 more Smart Citations
“…Meanwhile, for speech separation, i.e., separating the voice of a target speaker from multi-speaker signals, some works (Wang et al 2018;Mun et al 2020) introduce reference speeches (spoken by the same speaker as that of the target speech) in the form of global embeddings since the global embeddings of target speech and reference speech are correlated. Recently, Giri et al (Giri et al 2021) proposed to utilize the speaker identity embeddings extracted from a clean reference for both speech separation and enhancement. However, there is still no work introducing reference speeches for speech enhancement by exploring local correlations.…”
Section: Semantics Guided Speech Processingmentioning
confidence: 99%
“…Third, we give results by utilizing the reference in the way of global embedding, similar to (Giri et al 2021;Wang et al 2018). Specifically, we extract the speaker's identity features from the reference speech with a pretrained speaker encoder (Jia et al 2018), and then embed the global vector into our encoder features by concatenating along the feature dimension.…”
Section: Reference Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…The speaker embeddings have also been employed for SE. A perceptually motivated PSE model with low complexity was proposed in [2]. [1] introduced two real-time PSE models and tested with reverberant target speech corrupted by both noise and interfering speech.…”
Section: Related Workmentioning
confidence: 99%
“…Most modern telecommunication services are equipped with a causal/real-time speech enhancement (SE) front-end to deliver high-quality speech audio in noisy environments. Recently, "personalized" SE methods are emerging in the research field by utilizing an enrollment utterance of a target speaker as additional information to not only suppress the ambient noise and reverberation but also remove interfering speech [1,2]. With the ability to handle overlapped speech, personalized speech enhancement (PSE) models significantly improve the perceptual speech quality and the performance of downstream tasks such as automatic speech recognition (ASR) [1].…”
Section: Introductionmentioning
confidence: 99%