Figure 1: The goal of this work is to discover effective and cost-efficient data annotation strategies for the task of learning dense correspondences in the wild (DensePose). We significantly reduce the annotation effort by exploiting (a) sparse subsets of the DensePose labels augmented with cheaper kinds of annotations, such as object masks or keypoints, and (b) temporal information in videos to propagate ground truth and enforce dense spatio-temporal equivariance constraints.
AbstractDensePose supersedes traditional landmark detectors by densely mapping image pixels to body surface coordinates. This power, however, comes at a greatly increased annotation time, as supervising the model requires to manually label hundreds of points per pose instance. In this work, we thus seek methods to significantly slim down the Dense-Pose annotations, proposing more efficient data collection strategies. In particular, we demonstrate that if annotations are collected in video frames, their efficacy can be multiplied for free by using motion cues. To explore this idea, we introduce DensePose-Track, a dataset of videos where selected frames are annotated in the traditional DensePose manner. Then, building on geometric properties of the DensePose mapping, we use the video dynamic to propagate ground-truth annotations in time as well as to learn from Siamese equivariance constraints. Having performed exhaustive empirical evaluation of various data annotation and learning strategies, we demonstrate that doing so can deliver significantly improved pose estimation results over strong baselines. However, despite what is suggested by some recent works, we show that merely synthesizing motion patterns by applying geometric transformations to isolated frames is significantly less effective, and that motion cues help much more when they are extracted from videos. * James Thewlis and Iasonas Kokkinos were with Facebook AI Research (FAIR) during this work.