Audio-Visual Event Recognition Through the Lens of Adversary

Li, Juncheng B; Ma, Kaixin; Qu, Shuhui; Huang, Po-Yao; Metze, Florian

doi:10.1109/icassp39728.2021.9415065

Cited by 4 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section provides a detailed explanation of the proposed model, which consists of the following three components. (1) The convolution layer is used to extract low-level features and compress the frequency axis via pooling; (2) the capsule layer is composed of PrimaryCaps and EventCaps. The essence of PrimaryCaps is a convolutional layer, which is mainly used to prepare for EventCaps, and the output of EventCaps is a vector whose size represents the probability of events; (3) the recurrent layer is employed to study temporal context data and reckon the likelihood of event activity.…”

Section: B Classification Modelmentioning

confidence: 99%

“…In contrast, surveillance systems developed on the basis of audio analysis are not affected by changes in lighting and they do not have blind spots. By using only one mono microphone and one camera to integrate visual and audio data into the scene analysis, automatic surveillance systems' detection ability can be enhanced [1]. In recent years, urban environment sound detection has attracted increasingly more attention, and has been applied to various devices, such as audio surveillance devices [2], healthcare monitoring devices [3], [4], urban sound analytics devices [5], and smart home devices [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN

et al. 2021

View full text Add to dashboard Cite

In recent decades, surveillance and home security systems based on video analysis have been proposed for the automatic detection of abnormal situations. Nevertheless, in several real applications, it may be easier to detect a given event from audio information, and the use of audio surveillance systems can greatly improve the robustness and reliability of event detection. In this paper, a novel system for the detection of polyphonic urban noise is proposed for on-campus audio surveillance. The system aggregates different acoustic features to improve the classification accuracy of urban noise. A combination model composed of a capsule neural network (CapsNet) and recurrent neural network (RNN) is employed as the classifier. CapsNet overcomes some limitations of convolutional neural networks (CNNs), such as the loss of position information after max-pooling, and the RNN mainly models the temporal dependency of context information. The combination of these networks further improves the accuracy and robustness of polyphonic sound events detection. Moreover, a monitoring platform is designed to visualize noise maps and acoustic event information. The deployment architecture of the system is used in real environments, and experiments were also conducted on two public datasets. The results demonstrate that the proposed method is superior to existing state-of-art methods for the polyphonic sound event detection task.

show abstract

Section: B Classification Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Takeaways: Smaller features: 6× efficiency VS. 2% mAP loss. Models with Local+Global info → efficient & best Performance Attention-based models: More # params, harder/more sensitive to train, sharp loss landscape, but more robust to noise[25] …”

mentioning

confidence: 99%

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

Li¹,

Qu²,

Po-Yao³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

After its sweeping success in vision and language tasks, pure attention-based neural architectures (e.g. DeiT) [1] are emerging to the top of audio tagging (AT) leaderboards [2], which seemingly obsoletes traditional convolutional neural networks (CNNs), feed-forward networks or recurrent networks. However, taking a closer look, there is great variability in published research, for instance, performances of models initialized with pretrained weights differ drastically from without pretraining [2], training time for a model varies from hours to weeks, and often, essences are hidden in seemingly trivial details.This urgently calls for a comprehensive study since our 1st comparison [3] is half-decade old. In this work, we perform extensive experiments on AudioSet [4] which is the largest weakly-labeled sound event dataset available, we also did analysis based on the data quality and efficiency. We compare a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants. We also closely examine their optimization procedures. Our opensourced experimental results 1 provide insights to trade off between performance, efficiency, optimization process, for both practitioners and researchers. 2

show abstract

Employing multimodal co-learning to evaluate the robustness of sensor fusion for industry 5.0 tasks

et al. 2022

View full text Add to dashboard Cite

Audio-Visual Event Recognition Through the Lens of Adversary

Cited by 4 publications

References 19 publications

A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN

A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

Employing multimodal co-learning to evaluate the robustness of sensor fusion for industry 5.0 tasks

Contact Info

Product

Resources

About