2023

DOI: 10.1109/jproc.2023.3250266

|View full text |Cite

|

Sign up to set email alerts

|

An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

Andreas Triantafyllopoulos

¹

,

Björn Schuller

²

,

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

0

Mentioning

6

Contrasting

0

Year Published

2023

2023

2024

2024

Publication Types

Select...

Other2

Book2

Article2

Relationship

Self Cite0

Independent6

Authors

Journals

Cited by 20 publications

(6 citation statements)

References 149 publications

Supporting

0

Mentioning

6

Contrasting

0

Order By: Relevance

“…More importantly, to improve the naturalness of machine communication, the generation of emotionally expressive speech is required. While speech generation technologies have been making significant progress [4,5], emotionally expressive speech generation is still a challenge [6,7]. Speech emotion conversion (SEC) is a technique that aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical and the speaker information [6,8].…”

Section: Introductionmentioning

confidence: 99%

“…While speech generation technologies have been making significant progress [4,5], emotionally expressive speech generation is still a challenge [6,7]. Speech emotion conversion (SEC) is a technique that aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical and the speaker information [6,8]. Therefore, SEC has a crucial application in building next-generation human-machine interaction systems, aiming at equipping them with the ability to interact with social and emotional intelligence.…”

Section: Introductionmentioning

confidence: 99%

“…SEC datasets are primarily acted-out [7,10,15,16], as opposed to in-the-wild improvised datasets [12]. Actedout databases make a strong assumption on the availability of parallel utterances [6], i.e., one where each source utterance also has a ground-truth utterance of a target emotion, which is resource inefficient to collect [17,18] and techniques relying on them lack scalability [6].…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we specifically focus on non-parallel data. Models trained on non-parallel data are scalable to different emotion types [6], as they are not restricted by the emotion pairs trained on. However, they are also more challenging than modeling parallel data, as the problem of disentanglement arises [6,8].…”

Section: Introductionmentioning

confidence: 99%

“…Models trained on non-parallel data are scalable to different emotion types [6], as they are not restricted by the emotion pairs trained on. However, they are also more challenging than modeling parallel data, as the problem of disentanglement arises [6,8]. For SEC on non-parallel data, a disentanglement method is required to decompose the input speech signal into several constituents: emotion, lexical, and speaker information, encoded in respective latent representations.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

Lehmann‐Willenbrock³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance, and subsequently uses a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. For better representation and to achieve emotion intensity control, we specifically focus on the arousal dimension of continuous representations, as opposed to performing emotion conversion on categorical representations. We test our methodology on the large in-the-wild MSP-Podcast dataset. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion. Results further reveal that the methodology better synthesises speech for mid-scale arousal (2 to 6) than for extreme arousal (1 and 7).

“…More importantly, to improve the naturalness of machine communication, the generation of emotionally expressive speech is required. While speech generation technologies have been making significant progress [4,5], emotionally expressive speech generation is still a challenge [6,7]. Speech emotion conversion (SEC) is a technique that aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical and the speaker information [6,8].…”

Section: Introductionmentioning

confidence: 99%

“…While speech generation technologies have been making significant progress [4,5], emotionally expressive speech generation is still a challenge [6,7]. Speech emotion conversion (SEC) is a technique that aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical and the speaker information [6,8]. Therefore, SEC has a crucial application in building next-generation human-machine interaction systems, aiming at equipping them with the ability to interact with social and emotional intelligence.…”

Section: Introductionmentioning

confidence: 99%

“…SEC datasets are primarily acted-out [7,10,15,16], as opposed to in-the-wild improvised datasets [12]. Actedout databases make a strong assumption on the availability of parallel utterances [6], i.e., one where each source utterance also has a ground-truth utterance of a target emotion, which is resource inefficient to collect [17,18] and techniques relying on them lack scalability [6].…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we specifically focus on non-parallel data. Models trained on non-parallel data are scalable to different emotion types [6], as they are not restricted by the emotion pairs trained on. However, they are also more challenging than modeling parallel data, as the problem of disentanglement arises [6,8].…”

Section: Introductionmentioning

confidence: 99%

“…Models trained on non-parallel data are scalable to different emotion types [6], as they are not restricted by the emotion pairs trained on. However, they are also more challenging than modeling parallel data, as the problem of disentanglement arises [6,8]. For SEC on non-parallel data, a disentanglement method is required to decompose the input speech signal into several constituents: emotion, lexical, and speaker information, encoded in respective latent representations.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks

Lehmann‐Willenbrock³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Speech emotion conversion aims to convert the expressed emotion of a spoken utterance to a target emotion while preserving the lexical information and the speaker's identity. In this work, we specifically focus on in-the-wild emotion conversion where parallel data does not exist, and the problem of disentangling lexical, speaker, and emotion information arises. In this paper, we introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance, and subsequently uses a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. For better representation and to achieve emotion intensity control, we specifically focus on the arousal dimension of continuous representations, as opposed to performing emotion conversion on categorical representations. We test our methodology on the large in-the-wild MSP-Podcast dataset. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion. Results further reveal that the methodology better synthesises speech for mid-scale arousal (2 to 6) than for extreme arousal (1 and 7).

Exploring the Ethical Dimensions and Societal Consequences of Affective Computing

Mishra,

Deshpande,

Anna

et al. 2024

The Springer Series in Applied Machine Learning

View full text Add to dashboard Cite

No abstract

PiCo-VITS: Leveraging Pitch Contours for Fine-Grained Emotional Speech Synthesis

Wong,

Chung

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

No abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.