2022
DOI: 10.1109/taslp.2022.3169627
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Contrastive Learning for Singing Voices

Abstract: This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. Similarly, we employ two transformations-pitch shifting and time stretching-considering the nature of singing voices. Nevertheless, we use them reversely: we train ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(3 citation statements)
references
References 53 publications
0
3
0
Order By: Relevance
“…The proposed framework also utilizes semantic information, by taking into account the existence or absence of particular sources in each excerpt. This diverges from a number of research works in the literature that utilize a specific source, such as the singing voice [17][18][19] or percussion [20,21]. Application of our proposed methodology on top of COLA [7], a popular selfsupervised audio representation learning framework indi- cates that it can yield competitive results to a number of contrastive pair creation strategies [9,10] in three different downstream tasks.…”
Section: Introductionmentioning
confidence: 74%
See 1 more Smart Citation
“…The proposed framework also utilizes semantic information, by taking into account the existence or absence of particular sources in each excerpt. This diverges from a number of research works in the literature that utilize a specific source, such as the singing voice [17][18][19] or percussion [20,21]. Application of our proposed methodology on top of COLA [7], a popular selfsupervised audio representation learning framework indi- cates that it can yield competitive results to a number of contrastive pair creation strategies [9,10] in three different downstream tasks.…”
Section: Introductionmentioning
confidence: 74%
“…In [28], additive mixtures of audio excerpts are used in conjuction with their components as input pairs, while in [22], the authors expand on their previous work [28] by using an unsupervised source separation system [29] to extract separated views of the initial audio segment. In music signal processing, similarities between song excerpts and the corresponding vocals have been exploited, using triplet losses [17,18] or batch-wise contrastive losses [19], usually for the task of artist identification. Furthermore, in [20] the anchor-positive pairs are created using the percussive part of a music segment and its non-percussive accompaniment, without temporal cropping -an idea that was expanded in [21].…”
Section: Related Workmentioning
confidence: 99%
“…Nevertheless, the NMS community is typically overlooked in research focusing on the prevalent voice-based approach for person identification (PID). The predominant focus in voice-based PID studies lies in utilizing verbal vocalizations, specifically speech for speaker identification [3]- [6], and singing voice [7]- [12] for singer identification. In contrast, limited exploration exists regarding nonverbal vocalizations, albeit from speaking individuals, for PID.…”
Section: Related Workmentioning
confidence: 99%