Sensor data from wearable devices have been utilized to analyze differences between experts and novices. Previous studies attempted to classify the expert-novice level from sensor data based on supervised learning methods. However, these approaches need to collect enough training data covering various novices' sensor patterns. In this paper, we propose a semi-supervised anomaly detection approach that requires only sensor data of experts for training and identifies those of novices as anomalies. Our proposed anomaly detection model named conditional multimodal variational autoencoder (CMVAE) has the following two technical contributions: (i) considering action information of persons and (ii) utilizing multimodal sensor data, i.e., eye tracking data and motion data in this case. The proposed method is evaluated on sensor data measured when expert and novice soccer players were shooting, dribbling, and doing soccer ball juggling. Experimental results show that CMVAE can more accurately classify the expert-novice level than previous supervised learning methods and anomaly detection methods using other VAEs.