Tongue diagnosis is a non-invasive, painless diagnostic method by observing the tongue image of patients to diagnose and analyze their pathological conditions, which provides an opportunity for the future development of tongue diagnosis. However, the traditional tongue diagnosis method mainly relies on the experience and judgment of doctors, and is also easily affected by external factors. These factors hinder the development and application of tongue diagnosis. Currently, most studies use machine learning, which is time consuming and labor intensive. Other studies use deep learning based on convolutional neural network (CNN), but the affine transformation of CNN is less robust and easily loses the spatial relationship between features. In this work, we propose a traditional Chinese medicine (TCM) syndrome classification model of skin diseases based on tongue image hierarchical feature fusion. By adding a multi-scale residual module to the feature extraction part of the capsule network, we can extracted richer feature of tongue image. At the same time, the attention mechanism module is embedded in the multi-scale residual module, with the help of the attention mechanism module, the interference of tongue impurity information is suppressed, and accurate features are extracted for classification. Through experiments, it has been proven that our proposed method has achieved accuracy of 89.6\% in the classification of tongue for acne syndrome, and accuracy of 91.6\% in the dermatitis syndrome.