Creation of a corpus of realistic urban sound scenes with controlled acoustic properties

Gloaguen, Jean-Rémy; Can, Arnaud; Lagrange, Mathieu; Petiot, Jean-François

doi:10.1121/1.4989346

Cited by 5 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The original recordings range from 55 s to 4.5 mn in duration. In [43], the extracts are manually annotated and classified in terms of ambiance (park, quiet street, noisy street and very noisy street). The available annotations include:…”

Section: Stimulimentioning

confidence: 99%

“…The 45 s segments are selected to represent the properties of their respective ambiances in terms of source composition, without single events overwhelming their overall perception. The manual annotations available in [43] of background and event information are then used to replicate the sound scenes using simScene in the replicate mode, using the same isolated samples database as for the 75 simulated scenes. The 19 resulting sound scenes are thus simulated, but their scenarios follow those of the reference recordings to the extent of annotation precision.…”

Section: Stimulimentioning

confidence: 99%

“…These estimations are then applied to pleasantness estimation and compared to models presented in Section 4.1. To evaluate the detection model's robustness to the increased polyphony and source complexity of scenes in real-life scenarios, the listening test corpus is split in three parts: 1) the 6 recorded scenes which contain additional sources not present in the deep learning corpus, 2) the 19 replicated scenes that also include additional sources as annotated in [43], and 3) the 75 simulated scenes that are obtained from the same simulation process as both the development and evaluation subsets of the deep learning corpus. Pleasantness predictions are obtained by substituting time of presence estimates computed from the presence detection architecture's outputs to theP 1,ϕ model presented in Section 4.1.…”

Section: Prediction Using Deep Learningmentioning

confidence: 99%

See 2 more Smart Citations

Estimation of the Perceived Time of Presence of Sources in Urban Acoustic Environments Using Deep Learning Techniques

Gontier¹,

Lavandier²,

Aumond³

et al. 2019

Acta Acustica united with Acustica

Self Cite

View full text Add to dashboard Cite

The impact of urban sound on human beings has often been studied from a negative point of view (noise pollution). In the two last decades, the interest of studying its positive impact has been revealed with the soundscape approach (resourcing spaces). The literature shows that the recognition of sources plays a great role in the way humans are affected by sound environments. There is thus a need for characterizing urban acoustic environments not only with sound pressure measurements but also with source-specific attributes such as their perceived time of presence, dominance or volume. This paper demonstrates, on a controlled dataset, that machine learning techniques based on state of the art neural architectures can predict the perceived time of presence of several sound sources at a sufficient accuracy. To validate this assertion, a corpus of simulated sound scenes is first designed. Perceptual attributes corresponding to those stimuli are gathered through a listening experiment. From the contributions of the individual sound sources available for the simulated corpus, a physical indicator approximating the perceived time of presence of sources is computed and used to train and evaluate a multi-label source detection model. This model predicts the presence of simultaneously active sources from fast third octave spectra, allowing the estimation of perceptual attributes such as pleasantness in urban sound environments at a sufficient degree of precision.

show abstract

Section: Stimulimentioning

confidence: 99%

Section: Stimulimentioning

confidence: 99%

Section: Prediction Using Deep Learningmentioning

confidence: 99%

See 1 more Smart Citation

Estimation of the Perceived Time of Presence of Sources in Urban Acoustic Environments Using Deep Learning Techniques

Gontier¹,

Lavandier²,

Aumond³

et al. 2019

Acta Acustica united with Acustica

Self Cite

View full text Add to dashboard Cite

show abstract

“…where More details on the results can be found in [53]. As the perceived realism of the replicated and the recorded scenes are not significantly different, we consider that these sound mixtures are relevant to assess the performances of NMF according to the traffic sound level estimate.…”

Section: Generation Of the Evaluation Corpusmentioning

confidence: 99%

Road traffic sound level estimation from realistic urban sound mixtures by Non-negative Matrix Factorization

Gloaguen¹,

Can²,

Lagrange

et al. 2019

Applied Acoustics

Self Cite

View full text Add to dashboard Cite

Experimental acoustic sensor networks are currently tested in large cities, and appear more and more as a useful tool to enrich modeled road traffic noise maps through data assimilation techniques. One challenge is to be able to isolate from the measured sound mixtures acoustic quantities of interest such as the sound level of road traffic. This task is anything but trivial because of the multiple sound sources that overlap within urban sound mixtures. In this paper, the Non-negative Matrix Factorization (NMF) framework is developed to estimate road traffic noise levels within urban sound scenes. To evaluate the performances of the proposed approach, a synthetic corpus of sound scenes is designed, to cover most common soundscape settings, and whom realism is validated through a perceptual test. The simulated scenes reproduce then the sensor network outputs, in which the actual occurrence and sound level of each source are known. Several variants of NMF are tested. The proposed approach, named threshold initialized NMF, appears to be the most reliable approach, allowing road traffic noise level estimation with average errors of less than 1.3 dB over the tested corpus of sound scenes.

show abstract

“…Moreover, the existing corpora are either poorly annotated, either unrealistic or too complex. Therefore, in the case of a realistic sound mixture, the acoustic properties of the sound are hard to model [8]. The use of audio data from different recording points offers unprecedented opportunities for a relevant pattern of audio events.…”

Section: Audio Datasetsmentioning

confidence: 99%

Toulouse campus surveillance dataset

Malon

Roman-Jimenez

Guyot

et al. 2018

Proceedings of the 9th ACM Multimedia Systems Conference

View full text Add to dashboard Cite

In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears on several videos is a key problem. Many algorithms have been introduced and nowadays, a major relative problem is to evaluate precisely and to compare these algorithms, in reference to a common ground-truth. In this paper, our goal is to introduce a new dataset for evaluating multi-view based methods. This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object identification/tracking, audio event detection and multi-source meta-data modeling and querying. Consequently, we provide two sets of 25 synchronized videos with audio tracks, all depicting the same scene from multiple viewpoints, each set of videos following a detailed scenario consisting in comings and goings of people and cars. Every video was annotated by regularly drawing bounding boxes on every moving object with a flag indicating whether the object is fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events are also annotated by a category and timestamps.

show abstract

Creation of a corpus of realistic urban sound scenes with controlled acoustic properties

Cited by 5 publications

References 8 publications

Estimation of the Perceived Time of Presence of Sources in Urban Acoustic Environments Using Deep Learning Techniques

Estimation of the Perceived Time of Presence of Sources in Urban Acoustic Environments Using Deep Learning Techniques

Road traffic sound level estimation from realistic urban sound mixtures by Non-negative Matrix Factorization

Toulouse campus surveillance dataset

Contact Info

Product

Resources

About