Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Liu, Rui; Şişman, Berrak; Bao, Feilong; Yang, Jichen; Gao, Guanglai; Li, Haizhou

doi:10.1109/taslp.2020.3040523

Cited by 28 publications

(10 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech conveys information not only through phonetic content, but also through its prosody. Speech prosody can affect syntactic and semantic interpretation of an utterance [22], [23], that is called linguistic prosody. Speech prosody is also used to display one's emotional state, that is referred to as affective prosody.…”

Section: Introductionmentioning

confidence: 99%

Expressive TTS Training With Frame and Style Reconstruction Loss

Liu

Şişman

Gao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

We propose a novel training strategy for Tacotronbased text-to-speech (TTS) system that improves the speech styling at utterance level. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. This study marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. It adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms the state-ofthe-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.

show abstract

Section: Introductionmentioning

confidence: 99%

Expressive TTS Training With Frame and Style Reconstruction Loss

Liu

Şişman

Gao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Another benefit of SAN is to function with intra-attention [14,16] , which has a shorter path to model long distance context. Despite the progress [15], Transformer TTS doesn't explicitly associate input text with output utterances from syntactic point of view at sentence level, which is proven useful in speaking style and prosody modeling [17][18][19][20][21]. As a result, the rendering of utterance is adversely affected especially for long sentences.…”

Section: Introductionmentioning

confidence: 99%

Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis

Şişman

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.

show abstract

“…Electronic synthetic tones bring rich new sound experience to music of various styles and themes. Electronic musical instruments differ from traditional acoustic instruments in sound rendering principle and acoustic features [14][15][16][17][18][19]. Miranda et al [20] expounded the computeraided means to realize the acoustic features, voice editing, and modulation of electronic sound melodies and provided a valuable reference for applying electronic sound melodies in modern music creation.…”

Section: Introductionmentioning

confidence: 99%

Automatic Synthesis Technology of Music Teaching Melodies Based on Recurrent Neural Network

Zhang

2021

Scientific Programming

View full text Add to dashboard Cite

Computer music creation boasts broad application prospects. It generally relies on artificial intelligence (AI) and machine learning (ML) to generate the music score that matches the original mono-symbol score model or memorize/recognize the rhythms and beats of the music. However, there are very few music melody synthesis models based on artificial neural networks (ANNs). Some ANN-based models cannot adapt to the transposition invariance of original rhythm training set. To overcome the defect, this paper tries to develop an automatic synthesis technology of music teaching melodies based on recurrent neural network (RNN). Firstly, a strategy was proposed to extract the acoustic features from music melody. Next, the sequence-sequence model was adopted to synthetize general music melodies. After that, an RNN was established to synthetize music melody with singing melody, such as to find the suitable singing segments for the music melody in teaching scenario. The RNN can synthetize music melody with a short delay solely based on static acoustic features, eliminating the need for dynamic features. The proposed model was proved valid through experiments.

show abstract

Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis

Cited by 28 publications

References 62 publications

Expressive TTS Training With Frame and Style Reconstruction Loss

Expressive TTS Training With Frame and Style Reconstruction Loss

Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis

Automatic Synthesis Technology of Music Teaching Melodies Based on Recurrent Neural Network

Contact Info

Product

Resources

About