Subword and Crossword Units for CTC Acoustic Models

Zenkel, Thomas; Sanabria, Ramon; Metze, Florian; Waibel, Alex

doi:10.48550/arxiv.1712.06855

Cited by 4 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In these approaches, a single neural network is trained to recognize graphemes or even words from speech directly. Especially, the model using semantically meaningful units, such as words or sub-word (Sennrich et al, 2015), rather than graphemes have been showing promising results (Audhkhasi et al, 2017b;Li et al, 2018;Soltau et al, 2016;Zenkel et al, 2017;Palaskar and Metze, 2018;Sanabria and Metze, 2018;Rao et al, 2017;Zeyer et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

Acoustic-to-Word Models with Conversational Context Information

Kim

Metze

2019

Proceedings of the 2019 Conference of the North

Self Cite

View full text Add to dashboard Cite

Conversational context information, higherlevel knowledge that spans across sentences, can help to recognize a long conversation. However, existing speech recognition models are typically built at a sentence level, and thus it may not capture important conversational context information. The recent progress in end-to-end speech recognition enables integrating context with other available information (e.g., acoustic, linguistic resources) and directly recognizing words from speech. In this work, we present a direct acoustic-toword, end-to-end speech recognition model capable of utilizing the conversational context to better process long conversations. We evaluate our proposed approach on the Switchboard conversational speech corpus and show that our system outperforms a standard end-toend speech recognition system.

show abstract

Section: Introductionmentioning

confidence: 99%

Acoustic-to-Word Models with Conversational Context Information

Kim

Metze

2019

Proceedings of the 2019 Conference of the North

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our second observation is concerned with different choices of the modeling units. End-to-end systems directly map acoustic features to label sequences, which are composed of symbols like phonemes [6,7], characters [4,5,8,9,10,11], subwords [12,13] and words [14]. Phoneme based approaches need a carefully designed pronunciation lexicon to map words to phoneme sequences.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, the subword based model has shown impressive results in neural machine translation (NMT) [15,16] because of its ability to deal with infrequent words, like compounds, cognates as well as loanwords. For end-to-end speech recognition, there are also successful applications with subword units [12,13].…”

Section: Introductionmentioning

confidence: 99%

“…In [13], both subword units and cross-word units are generated with the byte-pair encoding (BPE) [15] method, and the neural network is trained based on the CTC loss using a subword and cross-word based language model. Cross-word units are taken into the unit set in order to model liaisons in oral English conversations, such as speaking "gonna" instead of "going to".…”

Section: Introductionmentioning

confidence: 99%

“…Cross-word units are taken into the unit set in order to model liaisons in oral English conversations, such as speaking "gonna" instead of "going to". However, the CTC model employed in [13] performs poorly without an external language model. Using an external language model would require a predefined dictionary.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units

Xiao

Chu³

et al. 2018

2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

In this paper, we present an end-to-end automatic speech recognition system, which successfully employs subword units in a hybrid CTC-Attention based system. The subword units are obtained by the byte-pair encoding (BPE) compression algorithm. Compared to using words as modeling units, using characters or subword units does not suffer from the out-of-vocabulary (OOV) problem. Furthermore, using subword units further offers a capability in modeling longer context than using characters. We evaluate different systems over the LibriSpeech 1000h dataset. The subword-based hybrid CTC-Attention system obtains 6.8% word error rate (WER) on the test clean subset without any dictionary or external language model. This represents a significant improvement (a 12.8% WER relative reduction) over the character-based hybrid CTC-Attention system.

show abstract

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Palaskar

Metze

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-ofvocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequenceto-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that locationaware attention naturally represents words as a single speechword-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forcedalignments for the Switchboard corpus.

show abstract

Subword and Crossword Units for CTC Acoustic Models

Cited by 4 publications

References 0 publications

Acoustic-to-Word Models with Conversational Context Information

Acoustic-to-Word Models with Conversational Context Information

Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Contact Info

Product

Resources

About