Keyword spotting in singing with duration-modeled HMMs

Kruspe, Anna

doi:10.1109/eusipco.2015.7362592

Cited by 6 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second topic related to our research is the singing voice lyrics-to-audio alignment. Most of these works [11,12,13,14,15,16,17,18] used the forced alignment method accompanied by music-related techniques. Loscos et al [12] used MFCCs with additional features and also explored specific HMM topologies.…”

Section: Related Workmentioning

confidence: 99%

“…Iskandar et al [15] constrained the alignment by using musical note length distribution. Gong et al [16], Kruspe [17], Dzhambazov and Serra [18] all used sylla-ble/phoneme duration extracted from the musical score and decoded the alignment path by duration-explicit HMM models. Chien et al [19] introduced an approach based on vowel likelihood models.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Gong

Serra

2018

Interspeech 2018

View full text Add to dashboard Cite

In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using languageindependent information -onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing voice dataset and manually annotate the boundaries and labels at phrase, syllable and phoneme levels. The dataset is publicly available. The proposed method is compared with a baseline method based on hidden semi-Markov model (HSMM) forced alignment. The evaluation results show that the proposed method outperforms the baseline by a large margin regarding both segmentation and onset detection tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Gong

Serra

2018

Interspeech 2018

View full text Add to dashboard Cite

show abstract

“…Because of the implicity of the Markovian state occupancy, the phonetic duration distribution introduced in section 3.3 can not be imposed. Kruspe [12] presents two duration modeling techniques for HMMs: Hidden semi-markov model (HSMM) and post-processor duration model.…”

Section: Duration Modelingmentioning

confidence: 99%

“…The post-processor duration model was first introduced by Juang et al [10]. It was then experimentally proved in Kruspe's paper [12] that this duration model works better than HSMMs for the keyword spotting task in English pop singing voice. The post-processor duration model uses the original HMMs Viterbi algorithm -therefore, during the decoding process no explicit occupancy duration distribution is imposed.…”

Section: Post-processor Duration Modelmentioning

confidence: 99%

See 1 more Smart Citation

Audio to score matching by combining phonetic and duration information

Gong,

Pons,

Serra

2017

Preprint

View full text Add to dashboard Cite

We approach the singing phrase audio to score matching problem by using phonetic and duration information -with a focus on studying the jingju a cappella singing case. We argue that, due to the existence of a basic melodic contour for each mode in jingju music, only using melodic information (such as pitch contour) will result in an ambiguous matching. This leads us to propose a matching approach based on the use of phonetic and duration information. Phonetic information is extracted with an acoustic model shaped with our data, and duration information is considered with the Hidden Markov Models (HMMs) variants we investigate. We build a model for each lyric path in our scores and we achieve the matching by ranking the posterior probabilities of the decoded most likely state sequences. Three acoustic models are investigated: (i) convolutional neural networks (CNNs), (ii) deep neural networks (DNNs) and (iii) Gaussian mixture models (GMMs). Also, two duration models are compared: (i) hidden semi-Markov model (HSMM) and (ii) post-processor duration model. Results show that CNNs perform better in our (small) audio dataset and also that HSMM outperforms the post-processor duration model.

show abstract

Modeling Phone Call Durations via Switching Poisson Processes with Applications in Mental Health

Bonilla-Escribano

Ramírez

Artés‐Rodríguez

2020

2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP)

View full text Add to dashboard Cite

This work models phone call durations via switching Poisson point processes. This kind of processes is composed by two intertwined intensity functions: one models the start of a call, whereas the other one models when the call ends. Thus, the call duration is obtained from the inverse of the intensity function of finishing a call. Additionally, to model the circadian rhythm present in human behavior, we shall use a (positive) truncated Fourier series as the parametric form of the intensities. Finally, the maximum likelihood estimates of the intensity functions are obtained using a trust region method and the performance is evaluated on synthetic and real data, showing good results.

show abstract

Keyword spotting in singing with duration-modeled HMMs

Cited by 6 publications

References 19 publications

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Audio to score matching by combining phonetic and duration information

Modeling Phone Call Durations via Switching Poisson Processes with Applications in Mental Health

Contact Info

Product

Resources

About