2021
DOI: 10.48550/arxiv.2103.15145
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

Abstract: Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent years. Despite this wave, multiple-object tracking (MOT) exhibits for now some sort of incompatibility with transformers. We argue that the standard representation -bounding boxes -is not adapted to learning transformers for MOT. Inspired by recent research, we propose Tran-sCenter, t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(33 citation statements)
references
References 50 publications
0
33
0
Order By: Relevance
“…In order to apply Transformer model, DETR [5] treats object detection as a set prediction problem. Transformers are also adopted for Super resolution in [55], Image Colorization in [21], Tracking in [8,54,58], Pose estimation in [29], etc. Besides, for video understanding, there are also recent approaches seek to resolve this challenge using the Transformer networks.…”
Section: Transformers In Computer Visionmentioning
confidence: 99%
“…In order to apply Transformer model, DETR [5] treats object detection as a set prediction problem. Transformers are also adopted for Super resolution in [55], Image Colorization in [21], Tracking in [8,54,58], Pose estimation in [29], etc. Besides, for video understanding, there are also recent approaches seek to resolve this challenge using the Transformer networks.…”
Section: Transformers In Computer Visionmentioning
confidence: 99%
“…The breakthroughs of the Transformer networks [60] in natural language processing (NLP) domain have sparked the interest of the computer vision community in developing vision transformers for different computer vision tasks, such as image classification [10,40], object detection [4,63,6,40], image segmentation [96,54,63,40], object tracking [80,81], pose estimation [42,58], etc. Among them, DPT [54] adopts a U-shape structure and uses ViT [10] as an encoder to perform semantic segmentation and monocular depth estimation.…”
Section: Vision Transformersmentioning
confidence: 99%
“…TBC [11] explicitly accounts for the object counts inferred from density maps and simultaneously solves detection and tracking. TransCenter [12] is a transformer-based architecture, which handles long-term complex dependencies by using an attention mechanism. However, these methods are limited in terms the degree to which speed can be increased without losing accuracy because there is a trade-off between speed and accuracy.…”
Section: Tracking Based On Detectionmentioning
confidence: 99%
“…However, human detection and feature extraction take a lot of time; hence a rich computational resource is required for real-time tracking. Some methods tackle this problem by simultaneous human detection and feature extraction with a single deep learning model [7,8,9,10,11,12]. However, there is a limitation on the degree to which speed can be increased without losing accuracy.…”
Section: Introductionmentioning
confidence: 99%