Abstract-Recognizing facial images captured under visible light has long been discussed in the past decades. However, there are many impact factors that hinder its successful application in real-world, e.g., illumination, pose variations. Recent work has concentrated on different spectrals, i.e., near infrared, that can only be perceived by specifically designed device to avoid the illumination problem. However, this inevitably introduces a new problem, namely, cross-modality classification. In brief, images registered in the system are in one modality, while images that captured momentarily used as the tests are in another modality. In addition, there could be many within-modality variationspose and expression-leading to a more complicated problem for the researchers. To address this problem, we propose a novel framework called hierarchical hyperlingual-words (Hwords) in this paper. First, we design a novel structure, called generic Hwords, to capture the high-level semantics across different modalities and within each modality in weakly supervised fashion, meaning only modality pair and variations information are needed in the training. Second, to improve the discriminative power of Hwords, we propose a novel distance metric through the hierarchical structure of Hwords. Extensive experiments on multimodality face databases demonstrate the superiority of our method compared with the state-of-the-art works on face recognition tasks subject to pose and expression variations.