ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747867
|View full text |Cite
|
Sign up to set email alerts
|

Learning Sound Localization Better from Semantically Similar Samples

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(14 citation statements)
references
References 25 publications
0
14
0
Order By: Relevance
“…Specifically, we train our model with 144k VGG Sound samples and test on the Flickr SoundNet test set. Comparing against [6,29,30], we significantly outperform all methods, as shown in Table 1, which shows our model is capable of generalizing well across datasets. We further investigate our method's robustness by testing on sound categories that are disjoint from what is seen during training.…”
Section: Methodsmentioning
confidence: 84%
See 3 more Smart Citations
“…Specifically, we train our model with 144k VGG Sound samples and test on the Flickr SoundNet test set. Comparing against [6,29,30], we significantly outperform all methods, as shown in Table 1, which shows our model is capable of generalizing well across datasets. We further investigate our method's robustness by testing on sound categories that are disjoint from what is seen during training.…”
Section: Methodsmentioning
confidence: 84%
“…Training Set cIoU 0.5 AUC cIoU Attention [28] Flickr 10k 0.436 0.449 CoarseToFine [25] 0.522 0.496 AVObject [1] 0.546 0.504 LVS * [6] 0.730 0.578 SSPL [30] 0.743 0.587 HTF (Ours) 0.860 0.634 Attention [28] Flickr 144k 0.660 0.558 DMC [19] 0.671 0.568 LVS * [6] 0.702 0.588 LVS † [6] 0.697 0.560 HardPos [29] 0.762 0.597 SSPL [30] 0 ilarly perform random cropping and horizontal flipping of the flow fields, which are performed consistently with image augmentations. For audio, we sample 3 seconds of the video at 16kHz and construct a log-scaled spectrogram using a bin size of 256, FFT window of 512 samples, and stride of 274 samples, resulting in a shape of 257 × 300.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…1. Research shows [4,16,29,36] that these false negatives will lead to contradictory objectives and harm the representation learning.…”
Section: Indicates Equal Contributionmentioning
confidence: 99%