Rethinking on Multi-Stage Networks for Human Pose Estimation

Li, Wenbo; Wang, Zhicheng; Yin, Binyi; Peng, Qixiang; Du, Yuming; Xiao, Tianzi; Yu, Gang; Lu, Hongtao; Wei, Yichen; Sun, Jun

doi:10.48550/arxiv.1901.00148

Cited by 73 publications

(108 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Heatmap-based pose estimation. Heatmap-based 2D pose estimation methods [2,3,6,7,14,21,25,27,36] estimate perpixel likelihoods for each keypoint location, and currently dominate in the field of 2D human pose estimation. A few works [2,25,27] attempt to design powerful backbone networks which can maintain high-resolution feature maps for heatmap supervision.…”

Section: Related Workmentioning

confidence: 99%

Poseur: Direct Human Pose Regression with Transformers

Mao¹,

Shen²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the stateof-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Poseur: Direct Human Pose Regression with Transformers

Mao¹,

Shen²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, multi-stage networks [15,33,108,110] have attained promising results relative to the aforementioned single-stage models on the challenging deblurring and deraining tasks [22,33,108]. These multi-stage frameworks are generally inspired by their success on higher-level problems such as pose estimation [17,47], action segmentation [24,46], and image generation [113,114].…”

Section: Related Workmentioning

confidence: 99%

“…We deem full resolution processing [15,67,75] a better approach than a multi-patch hierarchy [81,108,110], since the latter would potentially induce boundary effects across patches. To impose stronger supervision, we apply a multi-scale approach [17,19,47] at each stage to help the network learn. We leverage the supervised attention module [108] to propagate attentive feature maps progressively along the stages.…”

Section: Multi-stage Multi-scale Frameworkmentioning

confidence: 99%

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for crossfeature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and 'fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

show abstract

“…Most methods adopt deep convolutional neural network (CNN) as feature encoder owing to its great performance. In terms of the decoder part, existing approaches fall into two broad categories: heatmap-based [2,3,5,6,40,16,17,19,22,29,37,38] and regression-based [34,15,31,20,32,15] methods. The former is adopted in most cases.…”

Section: Introductionmentioning

confidence: 99%

Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Li¹,

Yang²,

Liu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The 2D heatmap representation has dominated human pose estimation for years due to its high performance. However, heatmap-based approaches have some drawbacks: 1) The performance drops dramatically in the low-resolution images, which are frequently encountered in real-world scenarios. 2) To improve the localization precision, multiple upsample layers may be needed to recover the feature map resolution from low to high, which are computationally expensive.3) Extra coordinate refinement is usually necessary to reduce the quantization error of downscaled heatmaps. To address these issues, we propose a Simple yet promising Disentangled Representation for keypoint coordinate (SimDR), reformulating human keypoint localization as a task of classification. In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement. Comprehensive experiments conducted over COCO dataset show that the proposed heatmap-free methods outperform heatmap-based counterparts in all tested input resolutions, especially in lower resolutions by a large margin. Code will be made publicly available at https://github.com/leeyegy/SimDR.

show abstract

Rethinking on Multi-Stage Networks for Human Pose Estimation

Cited by 73 publications

References 41 publications

Poseur: Direct Human Pose Regression with Transformers

Poseur: Direct Human Pose Regression with Transformers

MAXIM: Multi-Axis MLP for Image Processing

Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Contact Info

Product

Resources

About