ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414998
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Activity Driven Neural Speech Extraction

Abstract: Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 21 publications
(18 citation statements)
references
References 28 publications
0
18
0
Order By: Relevance
“…In many scenarios, it may not be necessary to reconstruct all speakers from the mixture, instead, it suffices to extract a single target speaker. This task has been given numerous names in the literature, among which are target speaker extraction [14], [15], informed speaker extraction [16], or simply speaker extraction (SE) [17], [18]. In contrast to SS, SE systems do not suffer from the permutation ambiguity since there exists only one single output.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In many scenarios, it may not be necessary to reconstruct all speakers from the mixture, instead, it suffices to extract a single target speaker. This task has been given numerous names in the literature, among which are target speaker extraction [14], [15], informed speaker extraction [16], or simply speaker extraction (SE) [17], [18]. In contrast to SS, SE systems do not suffer from the permutation ambiguity since there exists only one single output.…”
Section: Introductionmentioning
confidence: 99%
“…Other methods have exploited multi-modal information, for example by utilizing both visual features of the target speaker as well as an enrolment utterance [29], [30]. Finally, brain signals [31] and speaker activity [15] have also been utilized as auxiliary signals for SE.…”
Section: Introductionmentioning
confidence: 99%
“…The full separation attempts to estimate all sources in the mixture, which usually requires the knowledge of the number of sources. This arguably limits the practicality/flexibility of the separation (see the discussion in [18]- [20]). To alleviate, the recovery can be focused exclusively on the SOI, which is referred to as target speech extraction.…”
Section: Introductionmentioning
confidence: 99%
“…Once we obtain the semantic representations of concepts, we can identify the speech segments in the mixture related to them. Finally, we extract the target speech using the acoustic characteristics estimated from the identified segments in a manner similar to the activity driven extraction network (ADEnet) [8], which exploits speaker activity information for target speech extraction.…”
Section: Introductionmentioning
confidence: 99%