2023
DOI: 10.48550/arxiv.2302.02814
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Abstract: Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 63 publications
(161 reference statements)
0
2
0
Order By: Relevance
“…Trackers employing transformers can be categorized into two classes: CNN transformer-based trackers like STARK [6], AiATrack [25], TransT [18] and TrTr [26]. And fully-transformer based trackers like SwinTrack [27], MixFormer [19], Pro-ContEXT [28] and VideoTrack [29]. The state-ofthe-art tracker tends to utilize a fully-transformer architecture as it achieves the best performance [30], while CNN transformer architecture offers a balanced performance between speed and accuracy [30].…”
Section: Transformer-based Trackersmentioning
confidence: 99%
See 1 more Smart Citation
“…Trackers employing transformers can be categorized into two classes: CNN transformer-based trackers like STARK [6], AiATrack [25], TransT [18] and TrTr [26]. And fully-transformer based trackers like SwinTrack [27], MixFormer [19], Pro-ContEXT [28] and VideoTrack [29]. The state-ofthe-art tracker tends to utilize a fully-transformer architecture as it achieves the best performance [30], while CNN transformer architecture offers a balanced performance between speed and accuracy [30].…”
Section: Transformer-based Trackersmentioning
confidence: 99%
“…Additionally, transformers [15], originally applied to natural language processing (NLP), which have showed excellent performance in various computer vision tasks like ViT [16] and DETR [17], have gained attention in the tracking field. Some transformer-based trackers utilize CNNs as a backbone [18] [6], while others solely rely on attention mechanisms [19] [20] [7]. However, most existing trackers primarily rely on spatial information extracted from the object appearance, neglecting valuable temporal information such as motion model and changes in object appearance.…”
Section: Introductionmentioning
confidence: 99%