2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00033
|View full text |Cite
|
Sign up to set email alerts
|

MAAS: Multi-modal Assignation for Active Speaker Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(27 citation statements)
references
References 36 publications
0
27
0
Order By: Relevance
“…The Unified Context Network (Unicon) [74] proposes relational context modules to capture visual (spatial) and audiovisual context based on convolutional layers. MAAS-TAN [71] proposes a different multimodal graph approach. Following MAAS-TAN, SPELL [73] presents a model that achieved superior performance by proposing an efficient graph-based framework.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The Unified Context Network (Unicon) [74] proposes relational context modules to capture visual (spatial) and audiovisual context based on convolutional layers. MAAS-TAN [71] proposes a different multimodal graph approach. Following MAAS-TAN, SPELL [73] presents a model that achieved superior performance by proposing an efficient graph-based framework.…”
Section: Related Workmentioning
confidence: 99%
“…As shown in Table4, to attain a state-of-the-art result, recently, various ASD works that employ graph-based, contextual, and three-stage frameworks have been introduced. MAA-TAN [71] employed graph neural networks approach. SPELL [73] proposed a learning graph-based representation that can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure.…”
Section: Comparison With State-of-the-artmentioning
confidence: 99%
“…Alcázar et al [1,2] first exploit the temporal contextual and relational contextual information from multiple speakers to handle the active speaker detection task. Köpüklü et al [18] and Min et al [22] follow this idea to design structures that can better model temporal and relational contexts to improve detection performance.…”
Section: Related Workmentioning
confidence: 99%
“…mAP vs. FLOPs, size ∝ parameters. The mAP of different active speaker detection methods [1,2,18,22,36,44] on the benchmark and the FLOPs required to predict one frame containing three candidates. The size of the blobs is proportional to the number of model parameters.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation