2018
DOI: 10.48550/arxiv.1807.00543
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Punctuation Prediction Model for Conversational Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 8 publications
(11 citation statements)
references
References 0 publications
0
10
1
Order By: Relevance
“…More powerful models have included conditional random fields [4], boosted hierarchical prediction [5], and punctuation as neural machine translation (NMT) [6]. However, state of the art results have been achieved using convolutional neural networks (CNNs) [7], long short term memory (LSTM) networks [8,9,10], and transformers [1,11], treating the problem as a classification task. In these latter approaches, the task is to consider which punctuation symbol should follow each token in an utterance, rather than, say, detecting just sentence boundaries in text.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…More powerful models have included conditional random fields [4], boosted hierarchical prediction [5], and punctuation as neural machine translation (NMT) [6]. However, state of the art results have been achieved using convolutional neural networks (CNNs) [7], long short term memory (LSTM) networks [8,9,10], and transformers [1,11], treating the problem as a classification task. In these latter approaches, the task is to consider which punctuation symbol should follow each token in an utterance, rather than, say, detecting just sentence boundaries in text.…”
Section: Related Workmentioning
confidence: 99%
“…Various datasets have been proposed for punctuation prediction, including the Fisher corpus [7], TED talks and journalism data [12], and the IWSLT 2012 TED task [1,9,13]. We evaluate our unimodal text-based approach on the IWSLT TED Task dataset, comparing a baseline version of our approach to prior CNN, LSTM, and transformer architectures.…”
Section: Related Workmentioning
confidence: 99%
“…Some prosodic features like fundamental frequency and energy can be averaged across each word and used as input to the acoustic encoder. Similarly, word duration can also be used as a feature and the work by [16] has shown minor improvements in punctuation prediction for conversational speech by employing relative word timing and duration of word. However, such mechanism does not capture the acoustic context beyond a word and also prevents the use of frame-level acoustic features where the average vector does not represent anything.…”
Section: Forced Alignment Fusionmentioning
confidence: 99%
“…al. [16] with each layer having 128 weights in each direction. We also train another lexical only model which is a pretrained truncated BERT model [31] consisting of 6 transformer self attention layers with each hidden layer of size 768.…”
Section: Model Configurationsmentioning
confidence: 99%
See 1 more Smart Citation