ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414950
|View full text |Cite
|
Sign up to set email alerts
|

The use of Voice Source Features for Sung Speech Recognition

Abstract: In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…Test set GMM + 4grams [6] 52.95 49.50 TDNN-F + 3grams [6] 26.24 22.32 TDNN-F + 4-gram [6] 23.33 19.60 Kaldi LN + VQ + 3grams [7] No Value 22.97 Kaldi LN + VQ + 4grams [7] No Value 19.60 However, for the other SSL models, the best performance they achieved was by the transformer-based model. In previous studies, some researchers show that the objective of the SSL model has a higher impact on representation similarity than the model architecture [25].…”
Section: Models With Lyrics Wiki Lm -Wer Dev Setmentioning
confidence: 99%
See 2 more Smart Citations
“…Test set GMM + 4grams [6] 52.95 49.50 TDNN-F + 3grams [6] 26.24 22.32 TDNN-F + 4-gram [6] 23.33 19.60 Kaldi LN + VQ + 3grams [7] No Value 22.97 Kaldi LN + VQ + 4grams [7] No Value 19.60 However, for the other SSL models, the best performance they achieved was by the transformer-based model. In previous studies, some researchers show that the objective of the SSL model has a higher impact on representation similarity than the model architecture [25].…”
Section: Models With Lyrics Wiki Lm -Wer Dev Setmentioning
confidence: 99%
“…This is motivated by the fact that spoken and sung speech have the same production system and that semantic information is conveyed in the same way in both speech styles. [7]. However, there are several different acoustic features between sung and spoken speech, such as the pitch ranges, the syllable dura-tion, and the existence of vibrato in singing [7].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation