2021
DOI: 10.48550/arxiv.2101.03036
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Abstract: Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modality gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. Hence, comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a singl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(32 citation statements)
references
References 29 publications
0
32
0
Order By: Relevance
“…Zheng et al [30] propose a Gumbel attention module to alleviate the matching redundancy problem and a hierarchical adaptive matching model is employed to learn subtle feature representations from three different granularities. Recently, the NAFS proposed by Gao et al [5] is designed to extract full-scale image and textual representations with a novel staircase CNN network and a local constrained BERT model. Besides, a multimodal re-ranking algorithm by comparing the visual neighbors of the query to the gallery (RVN) is utilized to further improve the retrieval performance.…”
Section: Text-based Person Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…Zheng et al [30] propose a Gumbel attention module to alleviate the matching redundancy problem and a hierarchical adaptive matching model is employed to learn subtle feature representations from three different granularities. Recently, the NAFS proposed by Gao et al [5] is designed to extract full-scale image and textual representations with a novel staircase CNN network and a local constrained BERT model. Besides, a multimodal re-ranking algorithm by comparing the visual neighbors of the query to the gallery (RVN) is utilized to further improve the retrieval performance.…”
Section: Text-based Person Retrievalmentioning
confidence: 99%
“…By combining 𝐴𝑙𝑖𝑔𝑛𝐼𝐼 and 𝐴𝑙𝑖𝑔𝑛𝐼𝐼𝐼 with 𝐴𝑙𝑖𝑔𝑛𝐼 , SPSM can more completely separate person and surroundings information with the aid of SPFM and PDM. Besides, comparing the third row from the bottom with the last row in Table 1, the top-1, top-5 and top-10 performance increase by 1.12%, 0.71%, 0.61% and 1.43, [23] 13.66 -41.72 GNA-RNN [12] 19.05 -53.64 IATV [11] 25.94 -60.48 PWM-ATH [3] 27.14 49.45 61.02 Dual Path [31] 44.40 66.26 75.07 GLA [2] 43.58 66.93 76.26 MIA [18] 53.10 75.00 82.90 A-GANet [14] 53.14 74.03 81.95 GALM [8] 54.12 75.45 82.97 TIMAM [17] 54.51 77.56 84.78 IMG-Net [25] 56.48 76.89 85.01 CMAAM [1] 56.68 77.18 84.86 HGAN [30] 59.00 79.49 86.6 NAFS [5] 59 4, ablation experiments are conducted to search for the optimal zeroing rate π‘Ÿ . It can be observed that initially the performance of DSSL follows a increasing tendency with the growth of π‘Ÿ .…”
Section: Alignment Paradigmsmentioning
confidence: 99%
“…Although the multi-scale alignment provide supplement for global feature matching, the alignment for each scale is fixed. Gao et al [33] realized the need to align visual and textual clues across all scales and proposed cross-scale alignment for text-based person search.…”
Section: B Text-based Person Retrievalmentioning
confidence: 99%
“…Then both the visual feature and semantic feature from the two paths are fed into proposal-text embedding module. Inspired by [33], we propose a RoI-level cross scale matching scheme which utilizes mixed multi-scale features extracted from person proposals and text descriptions for feature embedding with the help of the nonlocal attention mechanism. Besides, the CMPM and CMPC loss are enforced on top of the global features to supervise the proposal-text cross-modal feature embedding process.…”
Section: A Overviewmentioning
confidence: 99%
See 1 more Smart Citation