ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682532
|View full text |Cite
|
Sign up to set email alerts
|

Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 14 publications
(32 citation statements)
references
References 17 publications
0
32
0
Order By: Relevance
“…We performed both architecture search as well as hyperparameter tuning for determining the best-performing model architecture and the number of convolutional blocks and hidden layer dimensions for recurrent and fully connected layers therein. The CNN-based architectures include standard CNN, CNN-GAP, CLDNN, and CNN-TD models [27]. The difference in these architectures is in the handling of the final output of the convolutional layers.…”
Section: Neural Network Architecturesmentioning
confidence: 99%
See 1 more Smart Citation
“…We performed both architecture search as well as hyperparameter tuning for determining the best-performing model architecture and the number of convolutional blocks and hidden layer dimensions for recurrent and fully connected layers therein. The CNN-based architectures include standard CNN, CNN-GAP, CLDNN, and CNN-TD models [27]. The difference in these architectures is in the handling of the final output of the convolutional layers.…”
Section: Neural Network Architecturesmentioning
confidence: 99%
“…The 3000 x 64-dimensional features are then reduced to binary class posteriors for foreground classification. Embeddings from a speech activity detection model trained on movie data [27] are used for the purposes of transfer learning for foreground detection task. Convolutional neural network models were trained on 0.64 s duration audio segments for a two class speech/non-speech classification problem.…”
Section: Features For Foreground Detectionmentioning
confidence: 99%
“…2 but with only two view-branches. [23] and speech activity detection [24]. The CNN of the viewbranches in our model is a smaller version of that in [24], and is shown in Fig.…”
Section: I-vectormentioning
confidence: 99%
“…Apart from feature-based [1,2,3] and statistical modeling approaches [4,5], recent research effort has been devoted to finding efficient deep-learning-based VAD model architectures. Notable examples include Recurrent Neural Networks (RNN) [6,7,8], Convolutional Neural Networks (CNN) [9,10,11,12], and Convolutional Long Short-Term Memory (LSTM) Deep Neural Networks (CLDNN) [13], which conduct frequency modeling with CNN and temporal modeling with LSTM. LSTM is a popular choice for sequential modeling of VAD tasks [13,6].…”
Section: Introductionmentioning
confidence: 99%
“…They also demonstrated that CNNs were useful acoustic models in novel channel scenarios and able to adapt well with limited amounts of data. Hebbar et al [11] compared different LSTM and CNN models with more challenging movie data which contained post-production stage and atypical speech such as electronically modified speech samples. They proposed a Convolutional Neural Network-Time Distributed (CNN-TD) model (with 740K parameters) that outperformed existing models including Bi-LSTM (with 300K parameters), CLDNN (with 1M parameters) and ResNet 960 (with 30M parameters) [9] on the benchmark evaluation dataset AVA-speech.…”
Section: Introductionmentioning
confidence: 99%