2021
DOI: 10.1109/tmm.2020.2993960
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Grained Visual Categorization by Localizing Object Parts With Single Image

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 20 publications
(7 citation statements)
references
References 51 publications
0
7
0
Order By: Relevance
“…In contrast, Cross-X [45] proposed a cross-layer fusion method based on the information correlation of different convolutional layers, and used a one-squeeze multi-excitation (OSME) module to generate multi-attention feature and a cross-semantic regularizer to group maps and fuse attention block features with similar semantics. Among them, the OSME module is a module proposed by [46] to capture different positions or different attention feature maps and is used in the network of multiline feature processing. Similarly, considering the diversity of feature map information in different convolutional layers, localizing object parts (LOP) [47] performed spectral clustering on the information of multiple convolutional layers to enhance the features and obtains more accurate and effective objects.…”
Section: Feature Fusion and Enhancement Methodsmentioning
confidence: 99%
“…In contrast, Cross-X [45] proposed a cross-layer fusion method based on the information correlation of different convolutional layers, and used a one-squeeze multi-excitation (OSME) module to generate multi-attention feature and a cross-semantic regularizer to group maps and fuse attention block features with similar semantics. Among them, the OSME module is a module proposed by [46] to capture different positions or different attention feature maps and is used in the network of multiline feature processing. Similarly, considering the diversity of feature map information in different convolutional layers, localizing object parts (LOP) [47] performed spectral clustering on the information of multiple convolutional layers to enhance the features and obtains more accurate and effective objects.…”
Section: Feature Fusion and Enhancement Methodsmentioning
confidence: 99%
“…• LOP [57]: Localizing the key object parts within each image only depending on a single image to avoid the influence of diversity between parts in different images.…”
Section: Methodsmentioning
confidence: 99%
“…Backbone Resolution Accuracy(%) MaxEnt [44] DenseNet-161 -84.9 WARN [45] WRN 224 × 224 85.6 DVAN [4] VGG-16 224 × 224 87.1 NTS-NET [18] ResNet-50 448 × 448 87.5 Cross-X [46] ResNet-50 448 × 448 87.7 CIN [51] ResNet-101 448 × 448 88.1 LAFE [48] ResNet-101 448 × 448 88.1 CDL [49] ResNet-50 448 × 448 88.4 AP-CNN [55] ResNet-50 448 × 448 88.4 DB [26] ResNet-50 448 × 448 88.6 LOP [57] ResNet-50 224 × 224 88.9 FDL [20] DenseNet-161 448 × 448 89.1 CSC-Net [47] ResNet-50 224 × 224 89.2 SCAPNet [56] ResNet-50 224 × 224 89.5 BAEM [58] DenseNet-161 448 × 448 89.5 PMG [52] ResNet-50 550 × 550 89.6 GaRD [53] ResNet-50 448 × 448 89.6 SnapMix [54] ResNet-101 448 × 448 89.6 API-NET [27] DenseNet-161 512 × 512 90.0 CPM [16] GoogLeNet over 800 90. 2) Stanford Dog: As can be seen from Table IV, our method shows more performance improvement on the Stanford Dog dataset, which is 2.2% higher than the current state-ofthe-art method API-NET [27] without using the contrastive learning mechanism and the high-resolution input.…”
Section: Methodsmentioning
confidence: 99%
“…The image feature learning part aims to learn the image semantic feature ϕ I . First, following the previous works [51,52], the input training image sample I tr is preprocessed as a standard normalization I tr norm . Second, the standardized training image sample I tr norm is input into five stacked residual blocks with two fully connected (FC) layers to obtain the initial image semantic feature.…”
Section: Image Feature Learningmentioning
confidence: 99%