Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability

Ghosh, Anubhab; Honoré, Antoine; Liu, Dong; Henter, Gustav Eje; Chatterjee, Saikat

doi:10.48550/arxiv.2107.00730

Cited by 1 publication

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Future work includes stronger network architectures, e.g., based on transformers [46], and/or a separate post-net like in [30]. It also seems compelling to combine neural HMMs with powerful distribution families such as normalising flows, either replacing the Gaussian assumption (as done for non-neural HMMs in [47]) or as a probabilistic post-net like in [22]. This might allow the naturalness of sampled speech to surpass that of deterministic output generation.…”

Section: Discussionmentioning

confidence: 99%

Neural HMMs are all you need (for high-quality attention-free TTS)

Mehta,

Székely,

Beskow

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Neural sequence-to-sequence TTS has demonstrated significantly better output quality over classical statistical parametric speech synthesis using HMMs. However, the new paradigm is not probabilistic and the use of non-monotonic attention both increases training time and introduces "babbling" failure modes that are unacceptable in production. In this paper, we demonstrate that the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing the attention in Tacotron 2 with an autoregressive leftright no-skip hidden-Markov model defined by a neural network. This leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. We discuss how to combine innovations from both classical and contemporary TTS for best results. The final system is smaller and simpler than Tacotron 2 and learns to align and speak with fewer iterations, whilst achieving the same naturalness prior to the post-net. Our system also allows easy control over speaking rate.

show abstract

Section: Discussionmentioning

confidence: 99%