Cymbidium goeringii (Rchb. f.) is a traditional Chinese flower with highly valued biological, cultural, and artistic properties. However, the valuation of Rchb. f. mainly relies on subjective judgment, lacking a standardized digital evaluation and grading methods. Traditional grading methods solely rely on unimodal data and are based on fuzzy grading standards; the key features for values are especially inexplicable. Accurately evaluating Rchb. f. quality through multi-modal algorithms and clarifying the impact mechanism of key features on Rchb. f. value is essential for providing scientific references for online orchid trading. A multi-modal Transformer for Rchb. f. quality grading combined with the Shapley Additive Explanations (SHAP) algorithm was proposed, which mainly includes one embedding layer, one UNet, one Vision Transformer (ViT) and one Encoder layer. A multi-modal orchid dataset including images and text was obtained from Orchid Trading Website, and seven key features were extracted. Based on petals’ RGB segmented from UNet and global fine-grained features extracted from ViT, text features and image features were organically fused into Transformer Encoders throughout concatenation operation, a 93.13% accuracy was achieved. Furthermore, SHAP algorithm was utilized to quantify and rank the importance of seven features, clarifying the impact mechanism of key features on Rchb. f. quality and value. This multi-modal Transformer with SHAP algorithm for Rchb. f. grading provided a novel idea to represent the explainable features accurately, exhibiting good potential for establishing a reliable digital evaluation method for agricultural products with high value.