ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747684
|View full text |Cite
|
Sign up to set email alerts
|

Genre-Conditioned Acoustic Models for Automatic Lyrics Transcription of Polyphonic Music

Abstract: Lyrics transcription of polyphonic music is challenging not only because the singing vocals are corrupted by the background music, but also because the background music and the singing style vary across music genres, such as pop, metal, and hip hop, which affects lyrics intelligibility of the song in different ways. In this work, we propose to transcribe the lyrics of polyphonic music using a novel genreconditioned network. The proposed network adopts pre-trained model parameters, and incorporates the genre ad… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 27 publications
0
9
0
Order By: Relevance
“…3) The Sound Source localization for Robots (SSLR) [28] is recorded using the humanoid Pepper robot 3 where four microphones and a stereo-vision are mounted on the robot head. It mostly uses a loudspeaker for recording, whereas human recordings only last for 4 minutes.…”
Section: A Existing Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…3) The Sound Source localization for Robots (SSLR) [28] is recorded using the humanoid Pepper robot 3 where four microphones and a stereo-vision are mounted on the robot head. It mostly uses a loudspeaker for recording, whereas human recordings only last for 4 minutes.…”
Section: A Existing Datasetsmentioning
confidence: 99%
“…interaction (HRI) applications, such as speech enhancement [1] and separation [2], music information processing [3], [4]. They can be estimated via the arrival time or energy level differences between signals from two spatially separated microphones [5], [6].…”
mentioning
confidence: 99%
“…In many studies, singing extractors and singing acoustic modeling are two steps in a pipeline, where each of these modules is independently trained [17], [21], [22] and extracted vocals often suffer from distortions in the extractor. On the other hand, [15], [20] have shown promising results where the vocal extraction step is avoided and the vocals along with the accompanying music are modeled together for lyrics recognition (i.e. direct modeling).…”
Section: A Motivationmentioning
confidence: 99%
“…This system performed well for the task of lyrics-to-audio alignment, however, showed a high word error rate (WER) in lyrics transcription. Musicinformed acoustic modelling that incorporated music genrespecific information are proposed [15], [20]. The study in [15] suggested that lyrics acoustic models can benefit from music genre knowledge of the background music but it requires an additional genre extraction step, separately.…”
Section: Introductionmentioning
confidence: 99%
“…For example, Seq2Sick (Cheng et al, 2020) generates adversarial examples that decrease the BLUE score of neural machine translation models. In addition to accuracy, inference efficiency is also highly critical for various real-time applications, e.g., speech recognition (Wang et al, 2022), machine translation (Fan et al, 2021;Zhu et al, 2020), lyric transcriptions (Gao et al, 2022b(Gao et al, , 2023(Gao et al, , 2022a. Recently, NICGSlowDown and NMT-Sloth (Chen et al, 2022d,c) propose delaying the appearance of the end token to reduce the efficiency of language generative models.…”
Section: Adversarial Attackmentioning
confidence: 99%