A better understanding of ElectroEncephaloGraphy (EEG) and ElectroCorticoGram (ECoG) signals would get us closer to comprehending brain functionality, creating new avenues for treating brain abnormalities and developing novel Brain-Computer Interface (BCI)-related applications. Deep Neural Networks (DNN)s have lately been employed with remarkable success to decode EEG/ECoG signals for BCI. However, the optimal architectural/training parameter values in these DNN architectures have yet to receive much attention. In addition, new data-driven optimization methodologies that leverage significant advancements in Machine Learning, such as the Transformer model, have recently been proposed. Because an exhaustive search on all possible architectural/training parameter values of the state-of-the-art DNN model (our baseline model) decoding the motor imagery EEG and finger tension ECoG signals comprising the BCI IV 2a and 4 datasets, respectively, would require prohibitively much time, this paper proposes an offline model-based optimization technique based on the Transformer model for the discovery of the optimal architectural/training parameter values for that model. Our findings indicate that we could pick better values for the baseline model's architectural/training parameters, enhancing the baseline model's performance by up to 14.7% in the BCI IV 2a dataset and by up to 61.0% in the BCI IV 4 dataset.
Speaker diarization is a task to identify "who spoke when". Moreover, nowadays, speakers' audio clips usually are accompanied by visual information. Thus, in the latest works, speaker diarization systems performance has been improved substantially by taking advantage of the visual information synchronized with audio clips in Audio-Visual (AV) content. This paper presents a deep learning architecture to implement an AV speaker diarization system emphasizing Voice Activity Detection (VAD). Traditional AV speaker diarization systems use hand-crafted features, like Mel-frequency cepstral coefficients, to perform VAD. On the other hand, the VAD module in our proposed system employs Convolutional Neural Networks (CNN) to learn and extract features from the audio waveforms directly. Experimental results on the AMI Meeting Corpus indicated that the proposed multimodal speaker diarization system reaches a state-of-the-art VAD False Alarm rate due to the CNN-based VAD, which in turn boosts the whole system's performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations –citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.