ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9415065
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Event Recognition Through the Lens of Adversary

Abstract: As audio/visual classification models are widely deployed for sensitive tasks like content filtering at scale, it is critical to understand their robustness along with improving the accuracy. This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How does different frequency/time domain features contribute to the robustness? 3) How does different neural module… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…This section provides a detailed explanation of the proposed model, which consists of the following three components. (1) The convolution layer is used to extract low-level features and compress the frequency axis via pooling; (2) the capsule layer is composed of PrimaryCaps and EventCaps. The essence of PrimaryCaps is a convolutional layer, which is mainly used to prepare for EventCaps, and the output of EventCaps is a vector whose size represents the probability of events; (3) the recurrent layer is employed to study temporal context data and reckon the likelihood of event activity.…”
Section: B Classification Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…This section provides a detailed explanation of the proposed model, which consists of the following three components. (1) The convolution layer is used to extract low-level features and compress the frequency axis via pooling; (2) the capsule layer is composed of PrimaryCaps and EventCaps. The essence of PrimaryCaps is a convolutional layer, which is mainly used to prepare for EventCaps, and the output of EventCaps is a vector whose size represents the probability of events; (3) the recurrent layer is employed to study temporal context data and reckon the likelihood of event activity.…”
Section: B Classification Modelmentioning
confidence: 99%
“…In contrast, surveillance systems developed on the basis of audio analysis are not affected by changes in lighting and they do not have blind spots. By using only one mono microphone and one camera to integrate visual and audio data into the scene analysis, automatic surveillance systems' detection ability can be enhanced [1]. In recent years, urban environment sound detection has attracted increasingly more attention, and has been applied to various devices, such as audio surveillance devices [2], healthcare monitoring devices [3], [4], urban sound analytics devices [5], and smart home devices [6].…”
Section: Introductionmentioning
confidence: 99%
“…Takeaways: Smaller features: 6× efficiency VS. 2% mAP loss. Models with Local+Global info → efficient & best Performance Attention-based models: More # params, harder/more sensitive to train, sharp loss landscape, but more robust to noise[25] …”
mentioning
confidence: 99%