Relation-guided acoustic scene classification aided with event embeddings

Hou, Yuanbo; Kang, Bo; Hauwermeiren, Wout Van; Botteldooren, Dick

doi:10.1109/ijcnn55064.2022.9892893

Cited by 15 publications

(19 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, real-life acoustic scenes and audio events naturally have implicit relationships with each other, and these relationships between scenes and events are not fully explored and used in Framework2. To this end, we recently proposed a new Relation-Guided ASC (RGASC) model to further exploit and coordinate the scene-event relation for the mutual benefit of scene and event recognition [19].…”

Section: Collaborative Acoustic Scene and Event Classificationmentioning

confidence: 99%

“…Inspired by the idea of RGASC [19], to jointly classify the auditory scene and label sound events, the collaborative scene-event classification (CSEC) framework is introduced. It uniquely extends current practice models by introducing a learnable coupling matrix between a scene classification branch that solely relies on basic acoustic features and an event identification branch that solely relies on acoustic features, to assist the acoustic scene classification.…”

Section: Collaborative Acoustic Scene and Event Classificationmentioning

confidence: 99%

See 1 more Smart Citation

Artificial intelligence-based collaborative acoustic scene and event classification to support urban soundscape analysis and classification

Hou

Botteldooren

2023

inter noise

View full text Add to dashboard Cite

A human listener embedded in a sonic environment will rely on meaning given to sound events as well as on general acoustic features to analyse and appraise its soundscape. However, currently used measurable indicators for soundscape mainly focus on the latter and meaning is only included indirectly. Yet, today's artificial intelligence (AI) techniques allow to recognise a variety of sounds and thus assign meaning to them. Hence, we propose to combine a model for acoustic event classification trained on the large-scale environmental sound database AudioSet, with a scene classification algorithm that couples direct identification of acoustic features with these recognised sound for scene recognition. The combined model is trained on TUT2018, a database containing ten everyday scenes. Applying the resulting AI-model to the soundscapes of the world database without further training shows that the classification that is obtained correlates to perceived calmness and liveliness evaluated by a test panel. It also allows to unravel why an acoustic environment sounds like a lively square or a calm park by analysing the type of sounds and their occurrence pattern over time. Moreover, disturbance of the acoustic environment that is expected based on visual clues, by e.g. traffic can easily be recognised.

show abstract

Section: Collaborative Acoustic Scene and Event Classificationmentioning

confidence: 99%

Section: Collaborative Acoustic Scene and Event Classificationmentioning

confidence: 99%

Artificial intelligence-based collaborative acoustic scene and event classification to support urban soundscape analysis and classification

Hou

Botteldooren

2023

inter noise

View full text Add to dashboard Cite

show abstract

“…The encoder layers are followed by a linear embedding layer with ReLU activation that maps the high-level representations of audio events to labels for classification. As the audio branch performs multilabel classification, binary cross-entropy (BCE) loss is used [9]. Denote the output of audio branch as ŷe ∈ R Ce , and the corresponding label as y e ∈ R Ce , the loss can be defined as:…”

Section: A the Audio Branchmentioning

confidence: 99%

“…Then, the event and object embeddings are concatenated together to form audio-visual semantic embeddings, and the fusion layer with ReLU activation maps the audio-visual embeddings into scene classes. As scene classification performs single-label multiclass classification, cross-entropy loss [9] is used between the output ŷs ∈ R Cs and the scene label y s ∈ R Cs ,…”

Section: Semantic-based Fusion (Sf)mentioning

confidence: 99%

“…For example, in a park scene, birds singing and dogs barking are more likely to occur than keyboard sounds, where the later are often found in the office scenes. To exploit the inherent relationships between the coarse-grained scenes and corresponding fine-grained events, relation-guided ASC [9] coordinates scene-event relationships for the mutual benefit of scene and event recognition. For ISC models [10] [11], the input is usually an image or image sequence [12], and then the scene is recognized based on the rich object information, spatial layout information, as well as the relationship between the objects and layouts.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Hou

Kang

Botteldooren

2022

2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP)

Self Cite

View full text Add to dashboard Cite

Previous works on scene classification are mainly based on audio or visual signals, while humans perceive the environmental scenes through multiple senses. Recent studies on audio-visual scene classification separately fine-tune the largescale audio and image pre-trained models on the target dataset, then either fuse the intermediate representations of the audio model and the visual model, or fuse the coarse-grained decision of both models at the clip level. Such methods ignore the detailed audio events and visual objects in audio-visual scenes (AVS), while humans often identify a scene through both audio events and visual objects within, and the congruence between them. To exploit the fine-grained information of audio events and visual objects in AVS, and coordinate the implicit relationship between audio events and visual objects, this paper proposes a multibranch model equipped with contrastive event-object alignment (CEOA) and semantic-based fusion (SF) for AVSC. CEOA aims to align the learned embeddings of audio events and visual objects by comparing the difference between audio-visual event-object pairs. Then, visual objects associated with certain audio events and vice versa are accentuated by cross-attention and undergo SF for semantic-level fusion. Experiments show that: 1) the proposed AVSC model equipped with CEOA and SF outperforms the results of audio-only and visual-only models, i.e., the audio-visual results are better than the results from a single modality. 2) CEOA aligns the embeddings of audio events and related visual objects on a fine-grained level, and the SF effectively integrates both; 3) Compared with other large-scale integrated systems, the proposed model shows competitive performance, even without using additional datasets and data augmentation tricks.

show abstract

Lightweight deep neural networks for acoustic scene classification and an effective visualization for presenting sound scene contexts

et al. 2023

View full text Add to dashboard Cite

Relation-guided acoustic scene classification aided with event embeddings

Cited by 15 publications

References 25 publications

Artificial intelligence-based collaborative acoustic scene and event classification to support urban soundscape analysis and classification

Artificial intelligence-based collaborative acoustic scene and event classification to support urban soundscape analysis and classification

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Lightweight deep neural networks for acoustic scene classification and an effective visualization for presenting sound scene contexts

Contact Info

Product

Resources

About