2016 International Joint Conference on Neural Networks (IJCNN) 2016
DOI: 10.1109/ijcnn.2016.7727633
|View full text |Cite
|
Sign up to set email alerts
|

Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
10
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 17 publications
(11 citation statements)
references
References 24 publications
1
10
0
Order By: Relevance
“…In all cases, HMM-based Viterbi decoding and "w-sum" decision fusion are used, where the combined log-likelihoods result from microphone-specific GMMs (left) or a GMM trained on a single microphone (right) In comparison, our proposed algorithm exhibits SAD errors of 5.7% and 4.7%, when operating over entire segments ("seg") or sliding windows ("win"), respectively. The latter represents a 19% relative SAD error reduction over the DNN of [30] and 10% over the 3D-CNN of [32], proving better than segment-based operation in the challenging and noisy DIRHA-sim-evalita data (as also observed in Table 6 for DIRHA-sim). These comparisons highlight the competitiveness of our two-stage system and the suitability of the five room discriminant features of its second stage.…”
Section: Comparison To Deep Learning Approachessupporting
confidence: 54%
See 4 more Smart Citations
“…In all cases, HMM-based Viterbi decoding and "w-sum" decision fusion are used, where the combined log-likelihoods result from microphone-specific GMMs (left) or a GMM trained on a single microphone (right) In comparison, our proposed algorithm exhibits SAD errors of 5.7% and 4.7%, when operating over entire segments ("seg") or sliding windows ("win"), respectively. The latter represents a 19% relative SAD error reduction over the DNN of [30] and 10% over the 3D-CNN of [32], proving better than segment-based operation in the challenging and noisy DIRHA-sim-evalita data (as also observed in Table 6 for DIRHA-sim). These comparisons highlight the competitiveness of our two-stage system and the suitability of the five room discriminant features of its second stage.…”
Section: Comparison To Deep Learning Approachessupporting
confidence: 54%
“…Specifically, in [29], a DNN is employed taking as input 176-dimensional vectors composed of a variety of features, such as MFCCs, RASTA-PLPs, envelope variance, and pitch. Similar features (but 187-dimensional) and DNNs are again considered in [30], as well as alternative classifiers, including a 2D-CNN. The latter is extended to a multi-channel 3D-CNN system in [31], where log-Mel filterbank energies (40-dimensional) are employed as features, temporal context is exploited by concatenating adjacent time frames, and the resulting 2D single-microphone feature matrices are stacked across channels.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations