2022
DOI: 10.48550/arxiv.2204.11420
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…The audiovisual embedding is finally classified into target categories by a Dense Layer. From results shown in recent papers [8], [9], [11], we can see that the visual data contributes to the AVSC performance more significantly than the audio data. If we train our proposed AVSC system with an endto-end training process, it possibly causes an overfitting on the visual branch and reduces the role of the audio branch.…”
Section: Proposed Deep Learning Based Multimodal For Avsc Taskmentioning
confidence: 76%
See 1 more Smart Citation
“…The audiovisual embedding is finally classified into target categories by a Dense Layer. From results shown in recent papers [8], [9], [11], we can see that the visual data contributes to the AVSC performance more significantly than the audio data. If we train our proposed AVSC system with an endto-end training process, it possibly causes an overfitting on the visual branch and reduces the role of the audio branch.…”
Section: Proposed Deep Learning Based Multimodal For Avsc Taskmentioning
confidence: 76%
“…Similar to the systems proposed for analyzing videos of human activities [5], [1], the state-of-the-art systems proposed for AVSC task also leveraged deep learning based models and presented joined audio-visual analysis. For instances, the proposed systems in [8], [9] used convolutional based models to extract audio embeddings from audio data and leveraged pre-trained deep learning models for extracting visual embeddings from visual data. Then, the audio embeddings and the visual embeddings are concatenated and fed into dense layers for classification.…”
Section: Introductionmentioning
confidence: 99%