2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01421
|View full text |Cite
|
Sign up to set email alerts
|

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Abstract: Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. For this study, we investigate binaural sounds and image data. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal ga… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 70 publications
0
1
0
Order By: Relevance
“…• Video-based sound source localization (SSL) [5], [8], [20], [60], [65], [66], [85], [94], [95], [97], [103], [110], [131], [138], [139], [148], [151], [153], [162]- [166] involves marking pixels' correspondence to each sound source, such as vehicles, in video frames. When the source of sound is a person, we have the audiovisual speaker localization (AVSL) [23], [35] problem, which involves identifying and locating the speaker(s) in an audio-visual scene, such as identifying and locating a person speaking in a video and tracking the speaker [21], [22], [33].…”
Section: Core Av Tasks and Problem Contextsmentioning
confidence: 99%
“…• Video-based sound source localization (SSL) [5], [8], [20], [60], [65], [66], [85], [94], [95], [97], [103], [110], [131], [138], [139], [148], [151], [153], [162]- [166] involves marking pixels' correspondence to each sound source, such as vehicles, in video frames. When the source of sound is a person, we have the audiovisual speaker localization (AVSL) [23], [35] problem, which involves identifying and locating the speaker(s) in an audio-visual scene, such as identifying and locating a person speaking in a video and tracking the speaker [21], [22], [33].…”
Section: Core Av Tasks and Problem Contextsmentioning
confidence: 99%