Facial recognition under uncontrolled acquisition environments faces major challenges that limit the deployment of real-life systems. The use of 2.5D information can be used to improve discriminative power of such systems in conditions where RGB information alone would fail. In this paper we propose a multimodal extension of a previous work, based on SIFT descriptors of RGB images, integrated with LBP information obtained from depth scans, modeled by an hierarchical framework motivated by principles of human cognition. The framework was tested on EURECOM dataset and proved that the inclusion of depth information improved significantly the results in all the tested conditions, compared to independent unimodal approaches.