Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2558
|View full text |Cite
|
Sign up to set email alerts
|

Hider-Finder-Combiner: An Adversarial Architecture for General Speech Signal Modification

Abstract: We introduce a prototype system for modifying an arbitrary parameter of a speech signal. Unlike signal processing approaches that require dedicated methods for different parameters, our system can -in principle -modify any control parameter that the signal can be annotated with. Our system comprises three neural networks. The 'hider' removes all information related to the control parameter, outputting a hidden embedding. The 'finder' is an adversary used to train the 'hider', attempting to detect the value of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…For deep latent representation learning methods, the challenge is to relate the learned representation to interpretable speech attributes. In Qian et al (2020) and Webber et al (2020), this interpretability is enforced by the design of the model. Qian et al (2020) proposed to use three independent encoder networks to decompose a speech signal into f 0 , timbre and rhythm latent representations.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For deep latent representation learning methods, the challenge is to relate the learned representation to interpretable speech attributes. In Qian et al (2020) and Webber et al (2020), this interpretability is enforced by the design of the model. Qian et al (2020) proposed to use three independent encoder networks to decompose a speech signal into f 0 , timbre and rhythm latent representations.…”
Section: Related Workmentioning
confidence: 99%
“…Qian et al (2020) proposed to use three independent encoder networks to decompose a speech signal into f 0 , timbre and rhythm latent representations. Webber et al (2020) focused on controlling source-filter parameters in speech signals, where the ability to control a given parameter (e.g., f 0 ) is enforced explicitly using labeled data and adversarial learning. In this approach, each parameter to be controlled requires a dedicated training of the model.…”
Section: Related Workmentioning
confidence: 99%
“…In the case of neural vocoders, the synthesis quality deteriorates when the input f o is not included in the range of the training data. Several approaches have been proposed to solve this problem [33], [34], [35], [36], [37], [38], [39]. In contrast to AR models [33], [34], [35], non-AR models [36], [37], [38], [39] can realize real-time inference.…”
mentioning
confidence: 99%
“…Several approaches have been proposed to solve this problem [33], [34], [35], [36], [37], [38], [39]. In contrast to AR models [33], [34], [35], non-AR models [36], [37], [38], [39] can realize real-time inference. The neural source filter [36] introduces nonlinear filtering and dilated convolutional layers for parametrically generated source excitation signals corresponding to f o by source-filter modeling [40].…”
mentioning
confidence: 99%