2021
DOI: 10.1109/taslp.2021.3091817
|View full text |Cite
|
Sign up to set email alerts
|

Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation

Abstract: The goal of singing voice separation is to recover the vocals signal from music mixtures. State-of-the-art performance is achieved by deep neural networks trained in a supervised fashion. Since training data are scarce and music signals are extremely diverse, it remains challenging to achieve high separation quality across various recording and mixing conditions as well as music styles. In this paper, we investigate to which extent the separation can be improved when lyrics transcripts are used as additional i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(5 citation statements)
references
References 28 publications
(74 reference statements)
0
5
0
Order By: Relevance
“…Schulze-Forster et al [158] adapted the attention mechanism for allowing the use of weakly-labeled side information by learning lyrics alignment as a byproduct. Schulze-Forster et al [188] further considered aligned phoneme sequences from lyrics transcripts, in contrast to word sequences in [157], as an additional input to the separation model. They proposed a sequential framework where a phoneme alignment network is followed by a modified Open-Unmix architecture for singing voice separation such that it takes the aligned phoneme information as an additional input.…”
Section: B Content-informed Data-driven Methodsmentioning
confidence: 99%
“…Schulze-Forster et al [158] adapted the attention mechanism for allowing the use of weakly-labeled side information by learning lyrics alignment as a byproduct. Schulze-Forster et al [188] further considered aligned phoneme sequences from lyrics transcripts, in contrast to word sequences in [157], as an additional input to the separation model. They proposed a sequential framework where a phoneme alignment network is followed by a modified Open-Unmix architecture for singing voice separation such that it takes the aligned phoneme information as an additional input.…”
Section: B Content-informed Data-driven Methodsmentioning
confidence: 99%
“…GPT-4, a large language model, serves as the annotator, while the Whisper speech recognition model [84] assists in audio transcription. Leveraging the MTG-Jamendo dataset with 55,000 audio songs in various languages, the model requires no training and undergoes direct testing on multiple datasets, including Jamendo [172], Hansen [173], MUSDB [174], and DSing [175]. This combined approach not only transcribes lyrics in multiple languages but also contributes to reducing the Word Error Rate (WER) in English.…”
Section: Automatic Speech Recognition (Asr)mentioning
confidence: 99%
“…The MusDB dataset consists of 150 full-length audios with background music, including 86 music for the train set, 14 for the validation, and 50 for the test set. MusDB is only 10 hours long due to the limited commercial music, and the lyrics are manually aligned using [22].…”
Section: Datasetmentioning
confidence: 99%