Large-Scale Datasets for Going Deeper in Image Understanding

Wu, Jiahong; Zheng, He; Zhao, Bo; Li, Yixin; Yan, Baoming; Liang, Rui; Wang, Wenjia; Zhou, Shuchang; Lin, Guosen; Fu, Yanwei; Wang, Yizhou; Wang, Yonggang

doi:10.1109/icme.2019.00256

Cited by 86 publications

(64 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to the SimpleBaseline [124] with the same input size, our small and big networks receive 1.2 and 1.8 improvements, respectively. With the additional data from AI Challenger [121] for training, our single big network can obtain an AP of 77.0.…”

Section: Methodsmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,836

1,479

View full text Add to dashboard Cite

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet. ! 1 INTRODUCTION D EEP convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations. Most recently-developed classification networks, including AlexNet [59], VGGNet [101], GoogleNet [108], ResNet [39], etc., follow the design rule of LeNet-5 [61]. This is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.High-resolution representations are needed for positionsensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass [83], Seg-Net [3], DeconvNet [85], U-Net [95], SimpleBaseline [124], and encoder-decoder [90]. In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations [15], [144].We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a highresolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network • J. Wang is with Microsoft Research,

show abstract

Section: Methodsmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,836

1,479

View full text Add to dashboard Cite

show abstract

“…Compared to the SimpleBaseline [72] with the same input size, our small and big networks receive 1.2 and 1.8 improvements, respectively. With additional data from AI Challenger [70] for training, our single big network can obtain an AP of 77.0.…”

Section: Coco Keypoint Detectionmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Human Pose Estimation

Sun

Xiao

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

4,127

2,538

View full text Add to dashboard Cite

In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.

show abstract

“…To evaluate the performance of multi-person pose estimation algorithms, several public benchmarks were established, such as MSCOCO [19], MPII [2] and AI Challenger [30]. In these benchmarks, the images are usually collected from daily life where crowded scenes appear less Figure 1.…”

Section: Introductionmentioning

confidence: 99%

CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark

Wang²,

Zhu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

460

236

View full text Add to dashboard Cite

Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In this paper, we propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. Our model consists of two key components: joint-candidate single person pose estimation (SPPE) and global maximum joints association. With multipeak prediction for each joint and global association using graph model, our method is robust to inevitable interference in crowded scenes and very efficient in inference. The proposed method surpasses the state-of-the-art methods on CrowdPose dataset by 5.2 mAP and results on MSCOCO dataset demonstrate the generalization ability of our method. Source code and dataset will be made publicly available.

show abstract

Large-Scale Datasets for Going Deeper in Image Understanding

Cited by 86 publications

References 62 publications

Deep High-Resolution Representation Learning for Visual Recognition

Deep High-Resolution Representation Learning for Visual Recognition

Deep High-Resolution Representation Learning for Human Pose Estimation

CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark

Contact Info

Product

Resources

About