Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization

Lin, Kun‐Wei; Zhuang, Xinhua; Goudeseune, Camille; King, Sarah; Hasegawa‐Johnson, Mark; Huang, Thomas S.

doi:10.1109/icassp.2012.6288368

Cited by 12 publications

(11 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We speculate that one possible solution to mitigate confusion errors would be to provide example recordings of sound classes to which annotators could refer while annotating. It is also possible that saliency maximization techniques such as the one proposed by Lin et al [27] could help reduce missed detection of events.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Seeing Sound

Cartwright

Seals

Salamon

et al. 2017

Proc. ACM Hum.-Comput. Interact.

View full text Add to dashboard Cite

Audio annotation is key to developing machine-listening systems; yet, effective ways to accurately and rapidly obtain crowdsourced audio annotations is understudied. In this work, we seek to quantify the reliability/redundancy trade-off in crowdsourced soundscape annotation, investigate how visualizations affect accuracy and efficiency, and characterize how performance varies as a function of audio characteristics. Using a controlled experiment, we varied sound visualizations and the complexity of soundscapes presented to human annotators. Results show that more complex audio scenes result in lower annotator agreement, and spectrogram visualizations are superior in producing higher quality annotations at lower cost of time and human labor. We also found recall is more affected than precision by soundscape complexity, and mistakes can be often attributed to certain sound event characteristics. These findings have implications not only for how we should design annotation tasks and interfaces for audio data, but also how we train and evaluate machine-listening systems.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Lin et al [27] developed a saliency-maximized audio spectrogram to enable fast detection of sound events by human annotators. They then conducted a study on the effect of this alternative representation on audio annotation quality.…”

Section: Related Workmentioning

confidence: 99%

Seeing Sound

Cartwright

Seals

Salamon

et al. 2017

Proc. ACM Hum.-Comput. Interact.

View full text Add to dashboard Cite

show abstract

“…In an AED task, the user is not permitted to observe Y [n 1 , n 2 ] directly; instead, he or she must observe X[n 1 , n 2 ], the spectrogram of the mixed noisy signal. The background noise with spectrogram N [n 1 , n 2 ] is irrelevant to the task (e.g., symphony music [Hasegawa-Johnson et al 2011] or speech [Lin et al 2012]). In order to help the user correctly identify the locations at which the target signal Y [n 1 , n 2 ] is nonzero, we propose to transform the image prior to display, using a learned image…”

Section: Saliency-maximized Audio Visualizationmentioning

confidence: 99%

Saliency-maximized audio visualization and efficient audio-visual browsing for faster-than-real-time human acoustic event detection

Lin

Zhuang

Goudeseune

et al. 2013

ACM Trans. Appl. Percept.

Self Cite

View full text Add to dashboard Cite

Examining large audio archives is a challenging task for humans owing to the limitations of human audition. We explore an innovative approach to engage both human vision and audition for audio browsing, which significantly improves human acoustic event detection in long audio recordings. In particular we visualize the data as a saliency-maximized spectrogram, accessed at different temporal scales using a special audio browser that also allows rapid zooming across scales from hours to milliseconds.The saliency-maximized audio spectrogram lets humans quickly search for and detect events in audio recordings. By rendering target events as visually salient patterns, this representation minimizes the time and effort needed to visually examine a recording. This transformation maximizes the mutual information between the spectrogram of an isolated target event and the estimated saliency of the overall visual representation.When subjects are shown spectrograms that are saliency-maximized instead of the original spectrograms, they perform significantly better in a 1/10-real-time acoustic event detection task.

show abstract

“…where i and j are the row and column pixel number of the spectrogram image, respectively, and J is the total number of column pixels, calculate the local image saliency by using equation (6).…”

Section: ) the Local Saliency Feature Of Mfccmentioning

confidence: 99%

“…Though the sound recognition work has been proved to be efficient by using the features mentioned above, however, these features are not visualized features which could be extracted automatically and complex process algorithm is needed. Some research work has been done recently by fusing both audio and visual signal information to do the recognition and perception work for robot or other platforms [6] [7] , but the image feature for fusion they used is the entire image, therefore, the fusion processing needs a lot of computing resource and the image features of the sound signal are still not saliency features.…”

mentioning

confidence: 99%

A visualized acoustic saliency feature extraction method for environment sound signal processing

Wang

Zhang²,

Madani

et al. 2013

2013 IEEE International Conference of IEEE Region 10 (TENCON 2013)

View full text Add to dashboard Cite

Environment perception is an important research issue for both unmanned ground vehicles and robots. To improve the capacity of perception, a visualized acoustic saliency feature extraction (VASFE) method based on both the short-time Fourier transform (STFT) and the Mel-Frequency Cepstrum Coefficient (MFCC) for environment sound signal processing is proposed in this paper. Sound signal is visualized by using the STFT algorithm as local image feature and the Mel-Frequency Cepstrum Coefficient (MFCC) is used to represent the local acoustic feature of the signal. The proposed VASFE method is tested by the natural sound data which collected from real world of both indoor and outdoor environment.The results show that this method is able to extract the saliency features of both longterm and short-term sound signal successfully and clearly, and conducts to very distinguishable features for future processing of environment sound information.

show abstract

Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization

Cited by 12 publications

References 8 publications

Seeing Sound

Seeing Sound

Saliency-maximized audio visualization and efficient audio-visual browsing for faster-than-real-time human acoustic event detection

A visualized acoustic saliency feature extraction method for environment sound signal processing

Contact Info

Product

Resources

About