Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music

Hou, Yuanbo; Soong, Frank K.; Luan, Jian; Li, Shengchen

doi:10.21437/interspeech.2020-1806

Cited by 13 publications

(16 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As baselines, a common and typical bi-modal recurrent neural model [19] is used as A-V baseline (Base-AV ), and a CRNN [2] trained by transfer learning is used as audiobased baseline (Base-A ) to compare the performance of the AV-VAD from more perspectives.…”

Section: Dataset Baseline and Experiments Setupmentioning

confidence: 99%

“…For evaluation metrics, event-based precision (P ), recall (R ), F-score and Error rate (ER ) [21] are used. Compared with segment-based metrics used in previous studies [22,16,2], event-based metrics are more rigorous and accurate to measure the location of events. Higher P, R, F and lower ER indicate a better performance.…”

Section: Dataset Baseline and Experiments Setupmentioning

confidence: 99%

“…To process these diverse data, voice activity detection (VAD) is an essential preprocessing in detecting the presence or absence of human voice in clips. Classical applications of VAD include speaker diarization [1], music [2] and speech [3] signal processing.…”

Section: Introductionmentioning

confidence: 99%

“…The AV-VAD proposed in this paper should detect not only speech, but also the singing of anchor. Due to the different articulation and phonation between speaking and singing, the speech activity detector does not perform well with musical clips [2]. Authors in [13] attempt to detect singing voice and speech based on the same audiobased VAD model, without distinguishing whether the speech or singing voice comes from the anchor or background.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams

Hou

Deng

Zhu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Detecting anchor's voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.

show abstract

Section: Dataset Baseline and Experiments Setupmentioning

confidence: 99%

Section: Dataset Baseline and Experiments Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams

Hou

Deng

Zhu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…To recognize speech and singing voices in these videos, voice activity detection (VAD) is a necessary preprocessing to identify the start and end time of human voice activities. VAD has attracted many interests due to its wide applications such as speech [1,2] and music information processing [3].…”

Section: Introductionmentioning

confidence: 99%

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Hou¹,

Yu²,

Liang³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.

show abstract

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Sun

Zhang

et al. 2022

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

Transfer Learning for Improving Singing-Voice Detection in Polyphonic Instrumental Music

Cited by 13 publications

References 17 publications

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams

Rule-Embedded Network for Audio-Visual Voice Activity Detection in Live Musical Video Streams

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Contact Info

Product

Resources

About