2019
DOI: 10.1186/s13636-019-0152-1
|View full text |Cite
|
Sign up to set email alerts
|

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Abstract: Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 34 publications
(13 citation statements)
references
References 26 publications
0
13
0
Order By: Relevance
“…A similar neural architecture is used in [36] to implement a noise-robust vowel-based SAD. A more recent paper combines both speech and music detection using recurrent LSTM networks [37]. Our latest work in SAD, used in the first DIHARD diarisation challenge [38] and the Albayzín 2018 diarisation challenge [39], is also based on a BLSTM classifier.…”
Section: Neural Network In Audio Segmentationmentioning
confidence: 99%
“…A similar neural architecture is used in [36] to implement a noise-robust vowel-based SAD. A more recent paper combines both speech and music detection using recurrent LSTM networks [37]. Our latest work in SAD, used in the first DIHARD diarisation challenge [38] and the Albayzín 2018 diarisation challenge [39], is also based on a BLSTM classifier.…”
Section: Neural Network In Audio Segmentationmentioning
confidence: 99%
“…While some of them are focused in the retrieval of information from specific kinds of acoustic signals, such as automatic speech recognition [1], [2], language or speaker identification [3], [4] (for speech signals) or music information retrieval [5], [6] (for musical signals), other tasks aim to determine the categories which an audio recording belongs to, among a set of target classes (e.g. human voice, vehicle, musical instruments) [7]. These categories can either refer to different environments where a recording can be obtained (e.g.…”
Section: Introductionmentioning
confidence: 99%
“…MIL methods usually consist of two parts, a dynamic predictor for generating the present probability of the specific event in each frame and a pooling function for aggregating frame-level probabilities to a clip-level prediction. For the dynamic predictor, conventional support vector machine (SVM) [14], Gaussian mixture model (GMM) [15], and neural network approaches [16][17][18][19] are employed to perform prediction for each event class. The pooling function is used to reduce the dimension of the dynamic feature space, which has a great impact on the overall performance of the weakly supervised SED system.…”
Section: Introductionmentioning
confidence: 99%