Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-3074
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(15 citation statements)
references
References 0 publications
0
15
0
Order By: Relevance
“…Moreover, in Reference [96], sentiment was predicted with the help of the multimodal approach. Punctuation predicted from conversational speech, using semi-supervised multimodal fusion techniques, is presented in Reference [97]. The hierarchical fusion technique was used for sentiment analysis using TAF data and social images [98][99][100].…”
Section: Discussionmentioning
confidence: 99%
“…Moreover, in Reference [96], sentiment was predicted with the help of the multimodal approach. Punctuation predicted from conversational speech, using semi-supervised multimodal fusion techniques, is presented in Reference [97]. The hierarchical fusion technique was used for sentiment analysis using TAF data and social images [98][99][100].…”
Section: Discussionmentioning
confidence: 99%
“…We instead use a single prediction for each token, and we find that we can achieve superior performance using much smaller context windows than [1]. Finally, [17,18] apply transformers to punctuation prediction using lexical features and prosodic features which are aligned using pre-trained feature extractors and alignment networks. In contrast to [17,18], we use forced-alignment from ASR and learn acoustic features from scratch from spectrogram segments corresponding each text tokens.…”
Section: Related Workmentioning
confidence: 99%
“…We seek to understand the effects of a multimodal approach on punctuation prediction with varying amounts of future information. While multimodal approaches are common for punctuation prediction [10,18,16], we are the first to incorporate learned acoustic features from scratch using forcealignment from ASR rather than relying on other data to pretrain or hand-select acoustic features.…”
Section: Podcast Taskmentioning
confidence: 99%
“…Speech signal holds some cues such as pauses and intonation patterns to predict punctuation marks [14]. Incorporation of speech cues to the text-based models is explored in [15,16] and have shown improvements in punctuation prediction. The distribution mismatch between text and conversational domains can be mitigated by retrofitting word embeddings to the target domain [17] when GloVe [18] embeddings are used in the model.…”
Section: Related Workmentioning
confidence: 99%