2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01313
|View full text |Cite
|
Sign up to set email alerts
|

Watching You: Global-guided Reciprocal Learning for Video-based Person Re-identification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 72 publications
(14 citation statements)
references
References 26 publications
0
14
0
Order By: Relevance
“…These state-of-the-art methods are all within three years and employed ResNet50 as their backbone to explore the spatial and temporal information among pedestrian images. They used attribute information [18] [24], attention mechanism [20] [27], graph convolution [28] [29] [14], 3D convolution [32] [34], relation-guided models [22] [23] [12], Generative Adversarial Networks (GAN) [30], and new network architectures [15] [21] [25] [26] [31] [33] [19] [13], respectively, to generate the feature representation of each pedestrian video. Meanwhile, the proposed PiT employs a transformer-based framework and utilizes the simple average fusion to obtain the multi-direction and multi-scale feature pyramid.…”
Section: B Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…These state-of-the-art methods are all within three years and employed ResNet50 as their backbone to explore the spatial and temporal information among pedestrian images. They used attribute information [18] [24], attention mechanism [20] [27], graph convolution [28] [29] [14], 3D convolution [32] [34], relation-guided models [22] [23] [12], Generative Adversarial Networks (GAN) [30], and new network architectures [15] [21] [25] [26] [31] [33] [19] [13], respectively, to generate the feature representation of each pedestrian video. Meanwhile, the proposed PiT employs a transformer-based framework and utilizes the simple average fusion to obtain the multi-direction and multi-scale feature pyramid.…”
Section: B Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Therefore, existing methods focus on exploiting both spatial and temporal clues from pedestrian video. GRL [12] employ video-level features to guide the generation of correlation map and disentangle the frame-level features into high-correlation and low-correlation features. BiCnet-TKS [13] introduced a bilateral complementary network to mine the divergent body parts of each pedestrian and proposed a temporal kernel selection module to explore temporal relations adaptively.…”
Section: A Video-based Pedestrian Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…From other methods, we see that there is no absolutely positive correlation between rank-1 accuracy and mAP. For instance, GRL [24] achieves the best rank-1 accuracy, but the mAP of GRL is significantly lower than MG-RAFA [48]. In MARS, of which detection boxes and tracking sequences are automatically generated by algorithms, the quality of images and tracklets may vary largely.…”
Section: Comparison With Other Methodsmentioning
confidence: 99%
“…In the right figure of Fig. 6, the vanilla Transformer Methods MARS Duke rank-1 mAP rank-1 mAP TKP [14] 84.0 73.3 94.0 91.7 STA [11] 86.2 81.2 96.0 95.0 GLTR [19] 87.0 78.5 96.3 93.7 MG-RAFA [48] 88.8 85.9 --STE-NVAN [23] 88.9 81.2 95.2 93.5 AGRL [39] 89.5 81.9 97.0 95.4 NL-AP3D [13] 90.7 85.6 97.2 96.1 STGCN [29] 90.0 83.7 97.3 95.7 GRL [24] 91.0 84. gives high attention weight to the 6th frame. We observe the spatial attention map of this frame (as shown in Fig.…”
Section: Comparison On Temporal Attentionmentioning
confidence: 99%