Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent years. Despite this wave, multiple-object tracking (MOT) exhibits for now some sort of incompatibility with transformers. We argue that the standard representation -bounding boxes -is not adapted to learning transformers for MOT. Inspired by recent research, we propose Tran-sCenter, the first transformer-based architecture for tracking the centers of multiple targets. Methodologically, we propose the use of dense queries in a double-decoder network, to be able to robustly infer the heatmap of targets' centers and associate them through time. TransCenter outperforms the current state-of-the-art in multiple-object tracking, both in MOT17 and MOT20. Our ablation study demonstrates the advantage in the proposed architecture compared to more naive alternatives. The code will be made publicly available.
Robust multi-person tracking with robots opens the door to analysing engagement and social signals in real-world environments. Multiperson scenarios are charaterised by (i) a time-varying number of people, (ii) intermittent auditory (e.g.speech turns) and visual cues (e.g.person appearing/disappearing) and (iii) impact of the robot actions in perception. The various sensors (cameras and microphones) available for perception, provide a rich flow of information of intermittent and complementary nature. How to jointly exploit these cues to tackle the multi-person tracking problem with an autonomous system has been an intense research line of the Perception Team in the past few years. In this demo we want to present our, now mature, achievements in the field, and demonstrate two robotic systems able to track multiple persons using auditory and visual cues, when they are available. We will bring the two robots and the necessary computing resources with us, as well as the required presentation materials to discuss the models, methods and tools supporting this technology with the attendants. CCS CONCEPTS • Mathematics of computing → Variational methods; • Information systems → Multimedia and multimodal retrieval; • Computing methodologies → Tracking; Scene understanding; Vision for robotics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.