Eighth IEEE International Symposium on Multimedia (ISM'06) 2006
DOI: 10.1109/ism.2006.38
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Synchronization between Lyrics and Music CD Recordings Based on Viterbi Alignment of Segregated Vocal Signals

Abstract: This paper describes a system that can automatically synchronize between polyphonic musical audio signals and corresponding lyrics. Although there were methods that can synchronize between monophonic speech signals and corresponding text transcriptions by using Viterbi alignment techniques, they cannot be applied to vocals in CD recordings because accompaniment sounds often overlap with vocals. To align lyrics with such vocals, we therefore developed three methods: a method for segregating vocals from polyphon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
36
0

Year Published

2009
2009
2019
2019

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 41 publications
(36 citation statements)
references
References 10 publications
0
36
0
Order By: Relevance
“…The baseline method [7] is based on an HMM in which each phoneme is represented by three hidden states, and the observed nodes correspond to the low-level feature, which we will call phoneme feature. Given a phoneme state, the 25 elements of the phoneme feature vector consist of 12 MFCCs, 12 MFCCs and 1 element containing the power difference (the subscript m stands for MFCC).…”
Section: Baseline Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The baseline method [7] is based on an HMM in which each phoneme is represented by three hidden states, and the observed nodes correspond to the low-level feature, which we will call phoneme feature. Given a phoneme state, the 25 elements of the phoneme feature vector consist of 12 MFCCs, 12 MFCCs and 1 element containing the power difference (the subscript m stands for MFCC).…”
Section: Baseline Methodsmentioning
confidence: 99%
“…More audio-centric approaches aimed at word-level alignment employ a hidden Markov model (HMM) and forced alignment [2], [7], [22]: Chen et al [2] use a VAD component to restrict alignment to vocal areas. Mesaros and Virtanen's HMM [22] use audio features based on the singing voice automatically segregated from the audio, but little attention is devoted to VAD: verse or chorus sections are manually selected.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…We thus calculate the ACF using the envelopes of filtered signals. The corresponding cross-channel correlation is calculated similarly to (7). Here, we extract envelopes by halfwave rectification and bandpass filtering [11].…”
Section: A Front-end Processingmentioning
confidence: 99%
“…They use matched filtering to model pitch likelihoods and a Gaussian mixture model (GMM) to model pitch transitions [21]. In a model-based method [7], sinusoidal models are used to model harmonic partials given detected pitch. The models are used to create smooth amplitude and phase trajectories over time and sinusoids are generated and summed to produce an estimate of the vocal signal.…”
mentioning
confidence: 99%