We propose a new framework using Deep Neural Networks (DNNs) to obtain accurate light source estimation. Color constancy is formulated as a DNN-based regression approach to estimate the color of the light source.Traditional color constancy algorithms are constrained to a number of imaging assumptions or shallow learning models based on hand-crafted features. In this paper, instead of using traditional feature representations, we exploit deep learning architectures by means of Convolutional Neural Networks (CNN). Different from existing methods which rely on predefined low-level features, we propose to use CNNs to learn feature hierarchies to achieve robust color constancy. The deep CNN model used in this work consists of eight (hidden) layers. Such a deep model will yield multi-scale image features composed of pixels, edges, object parts and object models. The proposed deep learning approach needs large amounts of data with ground-truth for training. Unfortunately, there are no such datasets available for color constancy. Therefore, we propose a different training approach. Firstly, we propose a new training strategy which contains three steps to learn hierarchical features for color constancy. Secondly, we propose a method to generate more training images with ground-truth labels.The model used in this work consists of eight layers as defined in [3]. The first five layers are convolutional layers. The last three layers are fully collected layers. Combining all the layers, the total number of parameters in this model is very large (around 60M). Therefore, large scale datasets with ground-truth light source labels are required to directly apply this model to the color constancy problem. However, such large scale dataset are not available. To this end, we propose an alternative training procedure consisting of different training steps in the following section. Further, we proposed a data augmentation method to generate more training images with ground truth labels.The training contains three steps. In the first step, we train the model on ImageNet to derive features for object description. In this way, a rich and generic feature hierarchy is learnt to capture the complex visual patterns in generic, real-world images. The last layer of the model is replaced by a 1000 dimensional vector. The soft-max loss function is used for training. The aim of the first training step is to obtain a pre-trained feature model representing general images. Since the ImageNet dataset contains 1000 object categories, it provides abundant back-propagation information for training. We denote the parameters obtained by training on ImageNet as the Net1 network.In the second step of training, the aim is to retrain and adjust the parameters of Net1 for the purpose of color constancy. We perform retraining of the (initial) parameters of Net1 based on light source estimation obtained by existing color constancy algorithms as labels. Although any other or combination of color constancy algorithms can be used to generate the labels for the Image...
We present a novel latent discriminative model for human activity recognition. Unlike the approaches that require conditional independence assumptions, our model is very flexible in encoding the full connectivity among observations, latent states, and activity states. The model is able to capture richer class of contextual information in both statestate and observation-state pairs. Although loops are present in the model, we can consider the graphical model as a linearchain structure, where the exact inference is tractable. Thereby the model is very efficient in both inference and learning. The parameters of the graphical model are learned with the Structured-Support Vector Machine (Structured-SVM). A datadriven approach is used to initialize the latent variables, thereby no hand labeling for the latent states is required. Experimental results on the CAD-120 benchmark dataset show that our model outperforms the state-of-the-art approach by over 5% in both precision and recall, while our model is more efficient in computation.
Abstract-An activity recognition system is a very important component for assistant robots, but training such a system usually requires a large and correctly labeled dataset. Most of the previous works only allow training data to have a single activity label per segment, which is overly restrictive because the labels are not always certain. It is, therefore, desirable to allow multiple labels for ambiguous segments. In this paper, we introduce the method of soft labeling, which allows annotators to assign multiple, weighted, labels to data segments. This is useful in many situations, e.g. when the labels are uncertain, when part of the labels are missing, or when multiple annotators assign inconsistent labels. We treat the activity recognition task as a sequential labeling problem. Latent variables are embedded to exploit sub-level semantics for better estimation. We propose a novel method for learning model parameters from soft-labeled data in a max-margin framework. The model is evaluated on a challenging dataset (CAD-120), which is captured by a RGB-D sensor mounted on the robot. To simulate the uncertainty in data annotation, we randomly change the labels for transition segments. The results show significant improvement over the state-of-the-art approach.
In this paper, we investigate and exploit the influence of facial expressions on automatic age estimation. Different from existing approaches, our method jointly learns the age and expression by introducing a new graphical model with a latent layer between the age/expression labels and the features. This layer aims to learn the relationship between the age and expression and captures the face changes which induce the aging and expression appearance, and thus obtaining expression-invariant age estimation. Conducted on three age-expression datasets (FACES , Lifespan and NEMO ), our experiments illustrate the improvement in performance when the age is jointly learnt with expression in comparison to expression-independent age estimation. The age estimation error is reduced by 14.43, 37.75 and 9.30 percent for the FACES, Lifespan and NEMO datasets respectively. The results obtained by our graphical model, without prior-knowledge of the expressions of the tested faces, are better than the best reported ones for all datasets. The flexibility of the proposed model to include more cues is explored by incorporating gender together with age and expression. The results show performance improvements for all cues.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.