The use of Voice Source Features for Sung Speech Recognition

Dabike, Gerardo Roa; Barker, Jon

doi:10.1109/icassp39728.2021.9414950

Cited by 1 publication

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Test set GMM + 4grams [6] 52.95 49.50 TDNN-F + 3grams [6] 26.24 22.32 TDNN-F + 4-gram [6] 23.33 19.60 Kaldi LN + VQ + 3grams [7] No Value 22.97 Kaldi LN + VQ + 4grams [7] No Value 19.60 However, for the other SSL models, the best performance they achieved was by the transformer-based model. In previous studies, some researchers show that the objective of the SSL model has a higher impact on representation similarity than the model architecture [25].…”

Section: Models With Lyrics Wiki Lm -Wer Dev Setmentioning

confidence: 99%

“…This is motivated by the fact that spoken and sung speech have the same production system and that semantic information is conveyed in the same way in both speech styles. [7]. However, there are several different acoustic features between sung and spoken speech, such as the pitch ranges, the syllable dura-tion, and the existence of vibrato in singing [7].…”

Section: Introductionmentioning

confidence: 99%

“…[7]. However, there are several different acoustic features between sung and spoken speech, such as the pitch ranges, the syllable dura-tion, and the existence of vibrato in singing [7]. These differences make sung speech difficult to recognize.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Investigating self-supervised learning for lyrics recognition

Zhang¹,

Garcia²,

He³

et al. 2022

Preprint

View full text Add to dashboard Cite

Lyrics recognition is an important task in music processing. Despite the great number of traditional algorithms such as the hybrid HMM-TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models. We evaluate four upstream SSL models based on their training method (masked reconstruction, masked prediction, autoregressive reconstruction, contrastive model). After applying the SSL model, the best performance improved by 5.23% for the dev set and 2.4% for the test set compared with the previous state-of-art baseline system even without language model trained by large corpus. Moreover, we study the generalization ability of the SSL features considering that those models were not trained on music datasets.

show abstract