Existing mental health assessment methods mainly rely on experts’ experience, which has subjective bias, so convolutional neural networks are applied to mental health assessment to achieve the fusion of face, voice, and gait. Among them, the OpenPose algorithm is used to extract facial and posture features; openSMILE is used to extract voice features; and attention mechanism is introduced to reasonably allocate the weight values of different modal features. As can be seen, the effective identification and evaluation of 10 indicators such as mental health somatization, depression, and anxiety are realized. Simulation results show that the proposed method can accurately assess mental health. Here, the overall recognition accuracy can reach 77.20%, and the F1 value can reach 0.77. Compared with the recognition methods based on face single-mode fusion, face + voice dual-mode fusion, and face + voice + gait multimodal fusion, the recognition accuracy and F1 value of proposed method are improved to varying degrees, and the recognition effect is better, which has certain practical application value.