We study the problem of photo cropping, which aims to find a cropping window of an input image to preserve as much as possible its important parts while being aesthetically pleasant. Seeking a deep learning-based solution, we design a neural network that has two branches for attention box prediction (ABP) and aesthetics assessment (AA), respectively. Given the input image, the ABP network predicts an attention bounding box as an initial minimum cropping window, around which a set of cropping candidates are generated with little loss of important information. Then, the AA network is employed to select the final cropping window with the best aesthetic quality among the candidates. The two sub-networks are designed to share the same full-image convolutional feature map, and thus are computationally efficient. By leveraging attention prediction and aesthetics assessment, the cropping model produces high-quality cropping results, even with the limited availability of training data for photo cropping. The experimental results on benchmark datasets clearly validate the effectiveness of the proposed approach. In addition, our approach runs at 5 fps, outperforming most previous solutions.
This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an "analysis by synthesis" scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.
This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete human parsing. We formulate the approach as a neural information fusion framework. Our model assembles the information from three inference processes over the hierarchy: direct inference (directly predicting each part of a human body using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively. In addition, the fusion of multi-source information is conditioned on the inputs, i.e., by estimating and considering the confidence of the sources. The whole model is end-to-end differentiable, explicitly modeling information flows and structures. Our approach is extensively evaluated on four popular datasets, outperforming the state-of-the-arts in all cases, with a fast processing speed of 23fps. Our code and results have been released to help ease future research in this direction. * Equal contribution. † Corresponding author: Yanwei Pang.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.