Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1969
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild

Abstract: We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(21 citation statements)
references
References 34 publications
0
21
0
Order By: Relevance
“…(figure 2) for example spectrograms). Secondly, compared to the gibbon-focused studies in [33] and [22] that in such cases we should typically use smaller and shallower model architectures in order to avoid overfitting on the limited training set, Rizos et al [34] instead showed that it was the deeper and more complex models that achieved the highest performance (compared to models like the ones used in [22,33]), including the one we use in this study.…”
Section: (Ii) Model Validationmentioning
confidence: 92%
See 4 more Smart Citations
“…(figure 2) for example spectrograms). Secondly, compared to the gibbon-focused studies in [33] and [22] that in such cases we should typically use smaller and shallower model architectures in order to avoid overfitting on the limited training set, Rizos et al [34] instead showed that it was the deeper and more complex models that achieved the highest performance (compared to models like the ones used in [22,33]), including the one we use in this study.…”
Section: (Ii) Model Validationmentioning
confidence: 92%
“…The classifier used in this study for whinny detection was first proposed in [34] as an improvement upon a deep, convolution-based, neural network architecture for acoustic event detection [61]. This improvement was achieved via the addition of attention-like mechanisms [62] that learn how to apply importance weights to the also learnt features as described in detail in [34]. Specifically, the model uses a squeeze-and-excitation mechanism [62] after each convolutional layer to reweigh the outputs of the convolutional filters, as well as a multiple-head attention mechanism [63] for pooling the sequential audio representation into a single, fixed vector representation.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations