2020
DOI: 10.48550/arxiv.2001.04758
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Audio-Visual Learning: A Survey

Abstract: Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities either to improve the performance of previously considered single-modality tasks or to address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four differe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(12 citation statements)
references
References 129 publications
(206 reference statements)
0
12
0
Order By: Relevance
“…They record the activities from different aspects, and cooperate together to help the viewer understand the video content. Recently, multimodal learning has proved that audio and vision modalities share a consistency space, and there are semantic relations between them [19], [20]. Lots of relevant video analysis tasks have demonstrated that the performance is promoted by utilizing the multimodal information in previous single modality tasks [21]- [23].…”
Section: A Motivation and Overviewmentioning
confidence: 99%
“…They record the activities from different aspects, and cooperate together to help the viewer understand the video content. Recently, multimodal learning has proved that audio and vision modalities share a consistency space, and there are semantic relations between them [19], [20]. Lots of relevant video analysis tasks have demonstrated that the performance is promoted by utilizing the multimodal information in previous single modality tasks [21]- [23].…”
Section: A Motivation and Overviewmentioning
confidence: 99%
“…Lastly, to enhance the quality of the coarse outputs and obtain fine-grained results, authors provided two-stage GAN network in [47]. For a detailed review of audio-image translation tasks, please refer to a recent survey [50].…”
Section: Related Workmentioning
confidence: 99%
“…The joint learning of both audio and visual information has received growing attention in recent years [53,19,15,35,23]. By leveraging data within the two modalities, researchers have shown success in learning audio-visual selfsupervision [4,2,3,25,31,22], audio-visual speech recognition [21,39,48,45], local-ization [47,38,37,34], event localization (parsing) [41,43,40], audio-visual navigation [13,5], cross-modality generation between the two modalities [9,51,8,6,48,7,52,49,42,50] and so on.…”
Section: Joint Audio-visual Learningmentioning
confidence: 99%