Recently, proposal-free instance segmentation has received increasing attention due to its concise and efficient pipeline. Generally, proposal-free methods generate instance-agnostic semantic segmentation labels and instance-aware features to group pixels into different object instances. However, previous methods mostly employ separate modules for these two sub-tasks and require multiple passes for inference. We argue that treating these two subtasks separately is suboptimal. In fact, employing multiple separate modules significantly reduces the potential for application. The mutual benefits between the two complementary sub-tasks are also unexplored. To this end, this work proposes a single-shot proposal-free instance segmentation method that requires only one single pass for prediction. Our method is based on a pixel-pair affinity pyramid, which computes the probability that two pixels belong to the same instance in a hierarchical manner. The affinity pyramid can also be jointly learned with the semantic class labeling and achieve mutual benefits. Moreover, incorporating with the learned affinity pyramid, a novel cascaded graph partition module is presented to sequentially generate instances from coarse to fine. Unlike previous time-consuming graph partition methods, this module achieves 5× speedup and 9% relative improvement on Average-Precision (AP). Our approach achieves state-of-the-art results on the challenging Cityscapes dataset.
Video object detection is a challenging task because of the presence of appearance deterioration in certain video frames. One typical solution is to aggregate neighboring features to enhance per-frame appearance features. However, such a method ignores the temporal relations between the aggregated frames, which is critical for improving video recognition accuracy. To handle the appearance deterioration problem, this paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection. To handle the displacement of the objects in videos, a novel DeformAlign module is proposed to align the spatial features from frame to frame. Instead of adopting a fixed-length window fusion strategy, a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information and requiring fewer video frames for aggregation to achieve better results. Our TCENet achieves state-of-the-art performance on the ImageNet VID dataset and has a faster runtime. Without bells-and-whistles, our TCENet achieves 80.3% mAP by only aggregating 3 frames.
Although various methods have been proposed for pedestrian attribute recognition, most studies follow the same feature learning mechanism, \ie, learning a shared pedestrian image feature to classify multiple attributes. However, this mechanism leads to low-confidence predictions and non-robustness of the model in the inference stage. In this paper, we investigate why this is the case. We mathematically discover that the central cause is that the optimal shared feature cannot maintain high similarities with multiple classifiers simultaneously in the context of minimizing classification loss. In addition, this feature learning mechanism ignores the spatial and semantic distinctions between different attributes. To address these limitations, we propose a novel disentangled attribute feature learning (DAFL) framework to learn a disentangled feature for each attribute, which exploits the semantic and spatial characteristics of attributes. The framework mainly consists of learnable semantic queries, a cascaded semantic-spatial cross-attention (SSCA) module, and a group attention merging (GAM) module. Specifically, based on learnable semantic queries, the cascaded SSCA module iteratively enhances the spatial localization of attribute-related regions and aggregates region features into multiple disentangled attribute features, used for classification and updating learnable semantic queries. The GAM module splits attributes into groups based on spatial distribution and utilizes reliable group attention to supervise query attention maps. Experiments on PETA, RAPv1, PA100k, and RAPv2 show that the proposed method performs favorably against state-of-the-art methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.