Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval

Wei, Xiu-Shen; Luo, Jian-Hao; Wu, Jianxin; Zhou, Zhi‐Hua

doi:10.1109/tip.2017.2688133

Cited by 412 publications

(240 citation statements)

References 37 publications

Supporting

Mentioning

238

Contrasting

Unclassified

Order By: Relevance

“…There exist two kinds of attention approaches: weighting [3,10] and selection [8,28]. Weighting approaches create attention by emphasizing convolutional activations of relevant information or by reducing activation of irrelevant information via multiplying weights.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Selection approaches direct attention to import information by selecting convolutional features; and the process is equivalent to applying a binary weight spatially in the case of using global average pooling or maximum activation pooling as aggregation techniques. For example, Wei et al [28] selected local features on the largest activated connected component of a convolutional layer. Hoang et al [8] select deep convolutional local features via masks (i.e.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Component-Based Attention for Large-Scale Trademark Retrieval

Tursun

Denman

Sivapalan

et al. 2022

IEEE Trans.Inform.Forensic Secur.

View full text Add to dashboard Cite

The demand for large-scale trademark retrieval (TR) systems has significantly increased to combat the rise in international trademark infringement. Unfortunately, the ranking accuracy of current approaches using either handcrafted or pre-trained deep convolution neural network (DCNN) features is inadequate for large-scale deployments. We show in this paper that the ranking accuracy of TR systems can be significantly improved by incorporating hard and soft attention mechanisms, which direct attention to critical information such as figurative elements and reduce attention given to distracting and uninformative elements such as text and background. Our proposed approach achieves state-of-the-art results on a challenging large-scale trademark dataset.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Component-Based Attention for Large-Scale Trademark Retrieval

Tursun

Denman

Sivapalan

et al. 2022

IEEE Trans.Inform.Forensic Secur.

View full text Add to dashboard Cite

show abstract

“…Zheng et al [24] group the convolutional channels to localize object parts in the well constrained spatial configurations. Wei et al [25] use a simple thresholding method to discover object parts and select the largest component to represent the desired foreground object. In contrast, we formulate the discovery procedure for scene recognition, where more complex semantic regions and unconstrained spatial structures exist.…”

Section: B Discriminative Region Discoverymentioning

confidence: 99%

Scene Recognition With Prototype-Agnostic Scene Layout

Chen

Song

Zeng

et al. 2020

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

Exploiting the spatial structure in scene images is a key research direction for scene recognition. Due to the large intra-class structural diversity, building and modeling flexible structural layout to adapt various image characteristics is a challenge. Existing structural modeling methods in scene recognition either focus on predefined grids or rely on learned prototypes, which all have limited representative ability. In this paper, we propose Prototype-agnostic Scene Layout (PaSL) construction method to build the spatial structure for each image without conforming to any prototype. Our PaSL can flexibly capture the diverse spatial characteristic of scene images and have considerable generalization capability. Given a PaSL, we build Layout Graph Network (LGN) where regions in PaSL are defined as nodes and two kinds of independent relations between regions are encoded as edges. The LGN aims to incorporate two topological structures (formed in spatial and semantic similarity dimensions) into image representations through graph convolution. Extensive experiments show that our approach achieves state-of-the-art results on widely recognized MIT67 and SUN397 datasets without multi-model or multi-scale fusion. Moreover, we also conduct the experiments on one of the largest scale datasets, Places365. The results demonstrate the proposed method can be well generalized and obtains competitive performance.

show abstract

“…Most fine-grained classification systems employ visual features of images to classify objects using a CNN [25][26][27][28][29], and subordinate classes from various domains such as flowers, birds, dogs, aircrafts, and cars can be successfully recognized using these approaches. The objects are visually similar to each other, and can only be discriminated through subtle details.…”

Section: Fine-grained Classificationmentioning

confidence: 99%

“…The objects are visually similar to each other, and can only be discriminated through subtle details. Most fine-grained classification systems employ visual features of images to classify objects using a CNN [25][26][27][28][29], and subordinate classes from various domains such as flowers, birds, dogs, aircrafts, and cars can be successfully recognized using these approaches. To improve the classification performance, some approaches employ hierarchical semantic information such as a taxonomic rank [30], the semantic distance of WordNet [31], and text [15,17].…”

Section: Fine-grained Classificationmentioning

confidence: 99%

Image classification and captioning model considering a CAM‐based disagreement loss

Yoon

Park

et al. 2019

ETRI Journal

View full text Add to dashboard Cite

Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end‐to‐end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.

show abstract

Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval

Cited by 412 publications

References 37 publications

Component-Based Attention for Large-Scale Trademark Retrieval

Component-Based Attention for Large-Scale Trademark Retrieval

Scene Recognition With Prototype-Agnostic Scene Layout

Image classification and captioning model considering a CAM‐based disagreement loss

Contact Info

Product

Resources

About