Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1806
|View full text |Cite
|
Sign up to set email alerts
|

Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music

Abstract: Detecting singing-voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose a data augmentation method for S-VD by transfer learning. In this study, clean speech clips with voice activity endpoi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

4
2

Authors

Journals

citations
Cited by 13 publications
(16 citation statements)
references
References 17 publications
0
16
0
Order By: Relevance
“…As baselines, a common and typical bi-modal recurrent neural model [19] is used as A-V baseline (Base-AV ), and a CRNN [2] trained by transfer learning is used as audiobased baseline (Base-A ) to compare the performance of the AV-VAD from more perspectives.…”
Section: Dataset Baseline and Experiments Setupmentioning
confidence: 99%
See 3 more Smart Citations
“…As baselines, a common and typical bi-modal recurrent neural model [19] is used as A-V baseline (Base-AV ), and a CRNN [2] trained by transfer learning is used as audiobased baseline (Base-A ) to compare the performance of the AV-VAD from more perspectives.…”
Section: Dataset Baseline and Experiments Setupmentioning
confidence: 99%
“…For evaluation metrics, event-based precision (P ), recall (R ), F-score and Error rate (ER ) [21] are used. Compared with segment-based metrics used in previous studies [22,16,2], event-based metrics are more rigorous and accurate to measure the location of events. Higher P, R, F and lower ER indicate a better performance.…”
Section: Dataset Baseline and Experiments Setupmentioning
confidence: 99%
See 2 more Smart Citations
“…To recognize speech and singing voices in these videos, voice activity detection (VAD) is a necessary preprocessing to identify the start and end time of human voice activities. VAD has attracted many interests due to its wide applications such as speech [1,2] and music information processing [3].…”
Section: Introductionmentioning
confidence: 99%