2020
DOI: 10.48550/arxiv.2006.04928
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to Count Words in Fluent Speech enables Online Speech Recognition

Abstract: Sequence to Sequence models, in particular the Transformer, achieve state of the art results in Automatic Speech Recognition. Practical usage is however limited to cases where full utterance latency is acceptable. In this work we introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting. We use the cumulative word sum to dynamically segment speech and enable its eager decoding into words. Experiments performed on the LRS2 and LibriSpeech datas… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…We are interested in studying if the word count in fluent speech can be estimated with a higher accuracy from audio-visual cues than from the audio modality alone. In Sterpu et al (2021) we saw that the encoding look-ahead length does not have a major influence on either the word counting error or the character error rate. Therefore, in this experiment we limit our analysis to counting words from audio-visual representations with the offline models having infinite context available.…”
Section: Learning To Count Words In Audio-visual Speechmentioning
confidence: 89%
See 4 more Smart Citations
“…We are interested in studying if the word count in fluent speech can be estimated with a higher accuracy from audio-visual cues than from the audio modality alone. In Sterpu et al (2021) we saw that the encoding look-ahead length does not have a major influence on either the word counting error or the character error rate. Therefore, in this experiment we limit our analysis to counting words from audio-visual representations with the offline models having infinite context available.…”
Section: Learning To Count Words In Audio-visual Speechmentioning
confidence: 89%
“…We begin this section by first reviewing the underlying approach Taris presented in Sterpu et al (2021).…”
Section: Tarismentioning
confidence: 99%
See 3 more Smart Citations