ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746991
|View full text |Cite
|
Sign up to set email alerts
|

Asd-Transformer: Efficient Active Speaker Detection Using Self And Multimodal Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 9 publications
0
8
0
Order By: Relevance
“…On the other hand, Tao et al [36] achieve superior performance by using cross-attention and self-attention modules to aggregate audio and visual features. Then, based on this work [36], Wuerkaixi et al [43] and Datta et al [9] improve the performance by introducing positional encoding and improving the attention module. In order to better exploit the potential of the attention module, Xiong et al [44] introduce a multi-modal layer normalization to alleviate the distribution misalignment of audio-visual features.…”
Section: Related Workmentioning
confidence: 95%
See 4 more Smart Citations
“…On the other hand, Tao et al [36] achieve superior performance by using cross-attention and self-attention modules to aggregate audio and visual features. Then, based on this work [36], Wuerkaixi et al [43] and Datta et al [9] improve the performance by introducing positional encoding and improving the attention module. In order to better exploit the potential of the attention module, Xiong et al [44] introduce a multi-modal layer normalization to alleviate the distribution misalignment of audio-visual features.…”
Section: Related Workmentioning
confidence: 95%
“…Mel-frequency cepstral coefficients (MFCCs) is one of the most widely used methods in audio recognition, aiming to improve the accuracy of speech activity detection [26]. Therefore, like most existing active speaker detection methods [9,36,37,43,44,46], we extract a 2-dimensional feature map composed of 13-dimensional MFCCs and temporal information from the original audio signal as the input of the audio feature encoder. However, we do not follow the general idea of previous studies using 2D convolutional neural networks to extract audio features.…”
Section: Audio Feature Encodermentioning
confidence: 99%
See 3 more Smart Citations