ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413432
|View full text |Cite
|
Sign up to set email alerts
|

Replacing Human Audio with Synthetic Audio for on-Device Unspoken Punctuation Prediction

Abstract: We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 17 publications
0
3
0
Order By: Relevance
“…Through augmentation of authentic human datasets using TTS, not only does this enhance the accuracy of self-training processes [11], but also the generalizability of ASR models by introducing prosodic and acoustic variations [12,13] into training data. Furthermore, as recent TTS models [8,9] are able to produce more natural, human-like speech, this has prompted researchers to explore synthetic speech datasets as complete substitutes for authentic human speech datasets [14,15]. Nevertheless, despite these notable developments in other research areas, no prior investigations have explored the potential impact of using synthetic data in DST.…”
Section: Related Workmentioning
confidence: 99%
“…Through augmentation of authentic human datasets using TTS, not only does this enhance the accuracy of self-training processes [11], but also the generalizability of ASR models by introducing prosodic and acoustic variations [12,13] into training data. Furthermore, as recent TTS models [8,9] are able to produce more natural, human-like speech, this has prompted researchers to explore synthetic speech datasets as complete substitutes for authentic human speech datasets [14,15]. Nevertheless, despite these notable developments in other research areas, no prior investigations have explored the potential impact of using synthetic data in DST.…”
Section: Related Workmentioning
confidence: 99%
“…We compare the performance of UniPunc with various baselines and SOTA systems, including: LSTM-T [13], Att-GRU [20], BERT [7], SAPR [8], Self-Att-Word-Speech [10] and MuSe [11]. We also compare to TTS punctuation data augmentation [19].…”
Section: Configurations and Baselinesmentioning
confidence: 99%
“…Secondly, the Parrotron model is designed to produce a generic speech, while in our task, it is critical to retain the child's original voice. Similarly, algorithms for foreign accent conversion [6] and pronunciation prediction [7] will not meet our lack of data and the goal of maintaining the speaker voice.…”
Section: Introductionmentioning
confidence: 99%