Proceedings of the 27th ACM International Conference on Multimedia 2019
DOI: 10.1145/3343031.3350590
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Variational Fusion for Multi-Person Tracking with Robots

Abstract: Robust multi-person tracking with robots opens the door to analysing engagement and social signals in real-world environments. Multiperson scenarios are charaterised by (i) a time-varying number of people, (ii) intermittent auditory (e.g.speech turns) and visual cues (e.g.person appearing/disappearing) and (iii) impact of the robot actions in perception. The various sensors (cameras and microphones) available for perception, provide a rich flow of information of intermittent and complementary nature. How to jo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…Thus, track initialization (birth) and de-activation (death) are required. Strategies usually rely on the temporal persistence of new observations and tracks, such as [13,46,47]. Within a given time interval, a new track is initialized if consistent un-associated observations appear in a nearby region.…”
Section: Introductionmentioning
confidence: 99%
“…Thus, track initialization (birth) and de-activation (death) are required. Strategies usually rely on the temporal persistence of new observations and tracks, such as [13,46,47]. Within a given time interval, a new track is initialized if consistent un-associated observations appear in a nearby region.…”
Section: Introductionmentioning
confidence: 99%
“…Recent works in the domain of audiovisual speaker localization showed promising results utilizing different data fusion strategies. For example, [5,6] introduce a variational Bayesian approximation to optimally merge acoustic and visual data for combined localization and tracking. A related approach introduced in [7] uses an expectation maximization (EM) algorithm for weighted clustering in the audiovisual observation space.…”
Section: Introductionmentioning
confidence: 99%
“…These advances are fueled by the introduction of new large-scale video datasets such as Moments in Time [42], Kinetics [32], and ActivityNet [28]. The ability to automatically recognize activities and events is an important component in video surveillance [1,43,46], virtual/augmented reality [5,24], and human-robot interaction [3,25,41]. Here, we aim to improve activity recognition by incorporating information that is missed or inherently invisible purely from videos.…”
Section: Introductionmentioning
confidence: 99%