Although there is an increasing interest in employing the depth data in computer vision applications, the spatial resolution of depth maps is still limited compared with typical visible-light images. A novel method is proposed to synthetically improve the spatial resolution of a single depth image. It integrates the higher-order terms into the Markov random field (MRF) formulation of example-based methods in order to improve the representational power of those methods. The inference is performed by approximately minimising the higher-order multi-label MRF energies. In addition, to improve the efficiency of the inference algorithm, a hierarchical scheme on the number of MRF states is proposed. First, a large number of states are used to obtain an initial labelling by solving the minimisation problem of inference for only the first-order energies. Then, the problem is solved for the higher-order energies in a smaller number of states. Performance comparisons show that proposed method improves the results of first-order approaches that are based on simple four-connected MRF graph structure, both qualitatively and quantitatively.
In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNN) for modeling motion information in deep-based approaches. Recently, transformers' success in modeling long-range dependencies in natural language processing (NLP) tasks has gotten great attention from other domains; including speech, image, and video, to rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic within the last few years is astounding. This paper especially reviews recent progress in deep learning methods for modeling temporal variations. It focuses on action recognition methods that use transformers for temporal modeling, discussing their main features, used modalities, and identifying opportunities and challenges for future research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.