Human Pose Estimation Using Deep Consensus Voting

Lifshitz, Ita; Fetaya, Ethan; Ullman, Shimon

doi:10.1007/978-3-319-46475-6_16

Cited by 130 publications

(75 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RefineNet [70] improves the combination of upsampled representations and the representations of the same resolution copied from the downsample process. Other works include: light upsample process [5], [19], [72], [124], possibly with dilated convolutions used in the backbone [47], [69], [91]; light downsample and heavy upsample processes [115], recombinator networks [40]; improving skip connections with more or complicated convolutional units [48], [89], [143], as well as sending information from low-resolution skip connections to highresolution skip connections [151] or exchanging information between them [34]; studying the details of the upsample process [120]; combining multi-scale pyramid representations [18], [125]; stacking multiple DeconvNets/U-Nets/Hourglass [31], [122] with dense connections [110].…”

Section: Related Workmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,855

1,479

View full text Add to dashboard Cite

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet. ! 1 INTRODUCTION D EEP convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations. Most recently-developed classification networks, including AlexNet [59], VGGNet [101], GoogleNet [108], ResNet [39], etc., follow the design rule of LeNet-5 [61]. This is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.High-resolution representations are needed for positionsensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass [83], Seg-Net [3], DeconvNet [85], U-Net [95], SimpleBaseline [124], and encoder-decoder [90]. In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations [15], [144].We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a highresolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network • J. Wang is with Microsoft Research,

show abstract

Section: Related Workmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,855

1,479

View full text Add to dashboard Cite

show abstract

“…Monocular RGB body pose estimation in 2D has been widely researched, but estimates only the 2D skeletal pose [Bourdev and Malik 2009;Felzenszwalb et al 2010;Felzenszwalb and Huttenlocher 2005;Ferrari et al 2009;Pishchulin et al 2013;Wei et al 2016]. Learning-based discriminative methods, in particular deep learning methods Lifshitz et al 2016;Newell et al 2016;Tompson et al 2014], represent the current state of the art in 2D pose estimation, with some of these methods demonstrating real-time performance [Cao et al 2016;Wei et al 2016]. Monocular RGB estimation of the 3D skeletal pose is a much harder challenge tackled by relatively fewer methods [Bogo et al 2016;Tekin et al 2016b,c;Zhou et al , 2015b.…”

Section: Introductionmentioning

confidence: 99%

VNect

et al. 2017

View full text Add to dashboard Cite

Fig. 1. We recover the full global 3D skeleton pose in real-time from a single RGB camera, even wireless capture is possible by streaming from a smartphone (left). It enables applications such as controlling a game character, embodied VR, sport motion analysis and reconstruction of community video (right). Community videos (CC BY) courtesy of Real Madrid C.F. [2016] and RUSFENCING-TV [2017].We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fullyconvolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control-thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e., it works for outdoor scenes, community videos, and low quality commodity RGB cameras.

show abstract

“…Our proposed method (i.e., Ours-weakC-2) used 9040 images in the MPII (i.e., half of the entire images) for the FS set and other images in "LSP+LSPext+MPII" dataset for the WS set. On the other hand, all images and annotations in MPII and "LSP+LSPext+MPII" were used for training in [74,50,76,49,46] (shown in the upper rows in the table) and [73,18] (shown in the lower rows), respectively. For reference, the results of the baseline [18] that used only half of the entire images in the MPII (i.e., Baseline-2 (HALF) in the table) are shown.…”

Section: Discussionmentioning

confidence: 99%

“…Unlike deformable part models, recent DCNN-based human pose estimation methods (e.g., [46,47,48,49,50,18,51]) acquire the position of each body joint from its corresponding heatmap. The heatmap of each joint is outputted from a DCNN as shown in Figure 2.…”

Section: Dcnn-based Heatmap Modelsmentioning

confidence: 99%

Semi- and weakly-supervised human pose estimation

Ukita

Uematsu

2018

Computer Vision and Image Understanding

View full text Add to dashboard Cite

For human pose estimation in still images, this paper proposes three semi-and weakly-supervised learning schemes. While recent advances of convolutional neural networks improve human pose estimation using supervised training data, our focus is to explore the semi-and weakly-supervised schemes. Our proposed schemes initially learn conventional model(s) for pose estimation from a small amount of standard training images with human pose annotations. For the first semi-supervised learning scheme, this conventional pose model detects candidate poses in training images with no human annotation. From these candidate poses, only true-positives are selected by a classifier using a pose feature representing the configuration of all body parts. The accuracies of these candidate pose estimation and true-positive pose selection are improved by action labels provided to these images in our second and third learning schemes, which are semi-and weakly-supervised learning. While the first and second learning schemes select only poses that are similar to those in the supervised training data, the third scheme selects more true-positive poses that are significantly different from any supervised poses. This pose selection is achieved by pose clustering using outlier pose detection with Dirichlet process mixtures and the Bayes factor. The proposed schemes are validated with large-scale human pose datasets.

show abstract

Human Pose Estimation Using Deep Consensus Voting

Cited by 130 publications

References 26 publications

Deep High-Resolution Representation Learning for Visual Recognition

Deep High-Resolution Representation Learning for Visual Recognition

VNect

Semi- and weakly-supervised human pose estimation

Contact Info

Product

Resources

About