“…As a part of the important information conveyed by human speech, emotional expressions are directly affected by the speaker's intentions that may lead to different emotions, e.g., π πππ, ππππ π¦, βπ π ππ¦, π ππ, π π’π ππππ π and πππ ππ’π π‘. Therefore, how to present appropriate emotions in synthetic speech is important in building diverse audio generation systems and immersive human-computer interaction systems [12], [13], [14], [15], [16], and thus has been drawn much attention recently [17], [18], [19], [20], [21], [22].…”