ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682475
|View full text |Cite
|
Sign up to set email alerts
|

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
184
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 228 publications
(185 citation statements)
references
References 18 publications
1
184
0
Order By: Relevance
“…Unsupervised learning has undergone major advances with the development of so-called self-supervised learning methods, which define application-specific proxy tasks to encourage neural networks to produce semantically structured representations. We propose a general unimodal and cross-modal representation learning technique based on the proxy task of coincidence prediction, which unifies recent work in audio-only [11] and audio-visual [12,13] self-supervised learning. The goal is to learn separate audio and image embeddings that can predict whether each sound-sound pair or each sound-image pair occurs within some prescribed temporal proximity in which semantic constituents are generally stable.…”
Section: Introductionmentioning
confidence: 92%
“…Unsupervised learning has undergone major advances with the development of so-called self-supervised learning methods, which define application-specific proxy tasks to encourage neural networks to produce semantically structured representations. We propose a general unimodal and cross-modal representation learning technique based on the proxy task of coincidence prediction, which unifies recent work in audio-only [11] and audio-visual [12,13] self-supervised learning. The goal is to learn separate audio and image embeddings that can predict whether each sound-sound pair or each sound-image pair occurs within some prescribed temporal proximity in which semantic constituents are generally stable.…”
Section: Introductionmentioning
confidence: 92%
“…If none of the outputs corresponding to the KK classes is above the threshold, the sample will be rejected. This MLP architecture is inspired in previous works, such as [15] or the baseline method used in [22]. This also allows for emphasizing the contribution of the proposed autoencoders, verifying their validity in a clearer way.…”
Section: Multi-layer Perceptronmentioning
confidence: 93%
“…This also allows for emphasizing the contribution of the proposed autoencoders, verifying their validity in a clearer way. In this context, the baseline method proposed in [22] uses audio embeddings obtained from the L3net network [15]. As a result, while the baseline employs transfer learning (the method relies on prior knowledge from a pre-trained network), we only make use of the samples available in the training dataset.…”
Section: Multi-layer Perceptronmentioning
confidence: 99%
See 1 more Smart Citation
“…Multiple attempts have been made in the particular domain of ambient sound either relying only on audio [4][5][6] or co-training with visualization [7]. However, these works are unsupervised or exploit only a very limited amount of labeled data.…”
Section: Introductionmentioning
confidence: 99%