LoCoNet: Long-Short Context Network for Active Speaker Detection

Wang, Xizi; Cao, Feng; Bertasius, Gedas; Crandall, David J.

doi:10.48550/arxiv.2301.08237

Cited by 2 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2. Without fine-tuning, our method achieves a state-of-the-art average F1 score of 81.1% on the Columbia dataset compared with TalkNet [36] and LoCoNet [42], showing good robustness.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 97%

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

Active speaker detection is a challenging task in audiovisual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23×) and FLOPs (0.6G vs. 2.6G, about 4×). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https: //github.com/Junhua-Liao/Light-ASD.

show abstract

“…2. Without fine-tuning, our method achieves a state-of-the-art average F1 score of 81.1% on the Columbia dataset compared with TalkNet [36] and LoCoNet [42], showing good robustness.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 97%

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the massive-scale and unconstrained nature of Ego4D, it has proved to be useful for various tasks including action recognition (Liu et al, 2022a;Lange et al, 2023), action detection (Wang et al, 2023a), visual question answering (Bärmann & Waibel, 2022), active speaker detection (Wang et al, 2023d), natural language localisation , natural language queries (Ramakrishnan et al, 2023), gaze estimation (Lai et al, 2022), persuasion modelling for conversational agents (Lai et al, 2023b), audio visual object localisation (Huang et al, 2023a), hand-object segmentation (Zhang et al, 2022b) and action anticipation (Ragusa et al, 2023a;Pasca et al, 2023;Mascaró et al, 2023). New tasks have also been introduced thanks to the diversity of Ego4D, e.g.…”

Section: General Datasetsmentioning

confidence: 99%

An Outlook into the Future of Egocentric Vision

Plizzari,

Goletto,

Furnari

et al. 2024

Int J Comput Vis

View full text Add to dashboard Cite

What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

show abstract

LoCoNet: Long-Short Context Network for Active Speaker Detection

Cited by 2 publications

References 24 publications

A Light Weight Model for Active Speaker Detection

A Light Weight Model for Active Speaker Detection

An Outlook into the Future of Egocentric Vision

Contact Info

Product

Resources

About