Thilo Koehler scite author profile

Thilo Koehler

5Publications

16Citation Statements Received

62Citation Statements Given

How they've been cited

How they cite others

Affiliations

Meta (Israel), University of Minnesota

Publications

Order By: Most citations

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Koehler

Fliegen

et al. 2020

View full text Add to dashboard Cite

Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on nonphonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-tospeech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.

show abstract

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis

Wu¹,

Xiu²,

Shi³

et al. 2021

View full text Add to dashboard Cite

Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning

Fong¹,

Wu²,

Agrawal³

et al. 2021

View full text Add to dashboard Cite

Multi-Rate Attention Architecture for Fast Streamable Text-to-Speech Spectrum Modeling

Xiu

Koehler

et al. 2021

View full text Add to dashboard Cite

Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with selfattention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O(L) increase in both latency and real-time factor (RTF) with respect to input length L. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

show abstract

Inverse kinematics for a multifingered hand

Koehler

Donath

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Thilo Koehler

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis

Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning

Multi-Rate Attention Architecture for Fast Streamable Text-to-Speech Spectrum Modeling

Inverse kinematics for a multifingered hand

Contact Info

Product

Resources

About