We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling. arXiv:1611.05009v4 [cs.CV] 10 Apr 2017 naïvely. We illustrate this in Fig. 1 for a 3D classification example. Given the 3D meshes of [48] we voxelize the input at a resolution of 64 3 and train a simple 3D convolutional network to minimize a classification loss. We depict the maximum of the responses across all feature maps at different layers of the network. It is easy to observe that high activations occur only near the object boundaries.Motivated by this observation, we propose OctNet, a 3D convolutional network that exploits this sparsity property. Our OctNet hierarchically partitions the 3D space into a set of unbalanced octrees [32]. Each octree splits the 3D space according to the density of the data. More specifically, we recursively split octree nodes that contain a data point in its domain, i.e., 3D points, or mesh triangles, stopping at the finest resolution of the tree. Therefore, leaf nodes vary in size, e.g., an empty leaf node may comprise up to 8 3 = 512 voxels for a tree of depth 3 and each leaf node in the octree stores a pooled summary of all feature activations of the voxel it comprises. The convolutional network operations are directly defined on the structure of these trees. Therefore, our network dynamically focuses computational and memory resources, depending on the 3D structure of the input. This leads to a significant reduction in computational and memory requirements which allows for deep learning at high resolutions. Importantly, we also show how essential network operations (convolution, pooling or unpooling) can be efficiently implemented on this new data structure.We demonstrate the utility of the proposed OctNet on three different problems involving three-dimensional data: 3D classification, 3D orientation estimation of unknown object instances and semantic segmentation of 3D point clouds. In particular, we show that the proposed OctNet enables significant higher input resolutions compared to dense inputs due to its lower memory consumption, while achieving identical performance compared to the equivalent dense network at lower resolutions. At the same time we gain significant speed-ups at resolutions of 128 3 and above. Using our OctNet, we investigate the impact of high resolution inputs wrt. accuracy on the three tasks and demonstrate that higher resol...