Emotions are an essential part of speech or communication, which is why they cannot be neglected. The existing text-to-speech systems are not the most appropriate at conveying the emotions present behind the text. The systems can speak out the text monotonically lacking expressiveness. In this paper, an Expressive Textto-Speech Synthesis System (ETSSS) is proposed which considers the dominant emotions in the text provided. ETSSS works in two parts: first, it identifies the label behind the text, and second produces expressive speech. In the first part, the input text is given an emotional label. Later, this label is used to generate expressive and prosodic speech. Labeling emotions in ETSSS is carried out using BERT which has an accuracy of 94%, 90%, and 90% for disgust, amused, and anger, respectively. The speech synthesis with the emotion module of ETSSS achieves a good MOS of 3.8 for anger, 3.5 for disgust, and 3.2 for amused.
IntroductionGenerating speech from text has been used for the past decade. It is important to note that emotions in speech play an important role. The three most common aspects of speech include intelligence, naturalness, and expressiveness. Prosody is defined as