Object Guided External Memory Network for Video Object Detection

Deng, Hanming; Yang, Hua; Song, Tao; Zhang, Zongpu; Zhi, Xue; Ma, Ruhui; Robertson, Neil; Guan, Haibing

doi:10.1109/iccv.2019.00678

Cited by 115 publications

(81 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike static images, videos have rich temporal information. In order to benefit from the temporal clues in videos, researchers developed several methods to aggregate information locally and globally using two or more frames [46], [47], [48], [49]. Similarly, we extend HoughNet with a new temporal voting module to incorporate temporal information using an additional (auxiliary) frame.…”

Section: Spatio-temporal Votingmentioning

confidence: 99%

HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

Samet

Hiçsönmez

Akbaş

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents HoughNet, a one-stage, anchor-free, voting-based, bottom-up object detection method. Inspired by the Generalized Hough Transform, HoughNet determines the presence of an object at a certain location by the sum of the votes cast on that location. Votes are collected from both near and long-distance locations based on a log-polar vote field. Thanks to this voting mechanism, HoughNet is able to integrate both near and long-range, class-conditional evidence for visual recognition, thereby generalizing and enhancing current object detection methodology, which typically relies on only local evidence. On the COCO dataset, HoughNet's best model achieves 46.4 AP (and 65.1 AP 50 ), performing on par with the state-of-the-art in bottom-up object detection and outperforming most major one-stage and two-stage methods. We further validate the effectiveness of our proposal in other visual detection tasks, namely, video object detection, instance segmentation, 3D object detection and keypoint detection for human pose estimation, and an additional "labels to photo" image generation task, where the integration of our voting module consistently improves performance in all cases. Code is available at https://github.com/nerminsamet/houghnet.

show abstract

Section: Spatio-temporal Votingmentioning

confidence: 99%

HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

Samet

Hiçsönmez

Akbaş

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Furthermore, external memories can benefit the long-term information storage, which can be useful for feature aggregation in the video domain [58], [59]. Besides, some techniques integrate detection trackers to exploit temporal information between keyframe processing [60]- [62].…”

Section: B Feature Aggregation Over Timementioning

confidence: 99%

Adaptive Inattentional Framework for Video Object Detection With Reward-Conditional Training

et al. 2020

View full text Add to dashboard Cite

Recent object detection studies have been focused on video sequences, mostly due to the increasing demand of industrial applications. Although single-image architectures achieve remarkable results in terms of accuracy, they do not take advantage of particular properties of the video sequences and usually require high parallel computational resources, such as desktop GPUs. In this work, an inattentional framework is proposed, where the object context in video frames is dynamically reused in order to reduce the computation overhead. The context features corresponding to keyframes are fused into a synthetic feature map, which is further refined using temporal aggregation with ConvLSTMs. Furthermore, an inattentional policy has been learned to adaptively balance the accuracy and the amount of context reused. The inattentional policy has been learned under the reinforcement learning paradigm, and using our novel reward-conditional training scheme, which allows for policy training over a whole distribution of reward functions and enables the selection of a unique reward function at inference time. Our framework shows outstanding results on platforms with reduced parallelization capabilities, such as CPUs, achieving an average latency reduction up to 2.09x, and obtaining FPS rates similar to their equivalent GPU platform, at the cost of a 1.11x mAP reduction.

show abstract

“…Full-sequence level feature aggregation is proposed in [42] to generate robust features for video object detection. External memory is used in [44] to store informative temporal features. In [43], speed-accuracy tradeoff for video object detection is studied.…”

Section: B Video Object Detectionmentioning

confidence: 99%

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Zhang

Kim

2020

IEEE Access

View full text Add to dashboard Cite

One of the major challenges in video object detection is drastic scale changes of objects due to camera motion. In this paper, we propose a two-path Convolutional Long Short-Term Memory (convLSTM) pyramid network designed to extract and convey multi-scale temporal contextual information in order to handle object scale changes efficiently. The proposed two-path convLSTM pyramid consists of a stack of multi-input convLSTM modules. It is updated in top-down and bottom-up pathways so that the temporal contextual information for small-to-large and large-to-small scale changes is exploited. The proposed multi-input convLSTM module uses two input feature maps of different resolutions to store and exchange temporal contextual information of different scales between neighboring convLSTM modules. The outputs of the proposed convLSTM pyramid network constitute a feature pyramid where each feature map contains multi-scale temporal contextual information from earlier frames. The proposed convLSTM pyramid can be combined with various still-image object detectors to improve the performance of video object detection. Experimental results on ImageNet VID dataset show that the proposed method achieves state-of-the-art performance and can handle scale changes efficiently in video object detection.

show abstract

Object Guided External Memory Network for Video Object Detection

Cited by 115 publications

References 28 publications

HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

Adaptive Inattentional Framework for Video Object Detection With Reward-Conditional Training

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Contact Info

Product

Resources

About