Social skills training, performed by human trainers, is a well-established method for obtaining appropriate skills in social interaction. Previous work automated the process of social skills training by developing a dialogue system that teaches social communication skills through interaction with a computer avatar. Even though previous work that simulated social skills training only considered acoustic and linguistic information, human social skills trainers take into account visual and other non-verbal features. In this paper, we create and evaluate a social skills training system that closes this gap by considering the audiovisual features of the smiling ratio and the head pose (yaw and pitch). In addition, the previous system was only tested with graduate students; in this paper, we applied our system to children or young adults with autism spectrum disorders. For our experimental evaluation, we recruited 18 members from the general population and 10 people with autism spectrum disorders and gave them our proposed multimodal system to use. An experienced human social skills trainer rated the social skills of the users. We evaluated the system’s effectiveness by comparing pre- and post-training scores and identified significant improvement in their social skills using our proposed multimodal system. Computer-based social skills training is useful for people who experience social difficulties. Such a system can be used by teachers, therapists, and social skills trainers for rehabilitation and the supplemental use of human-based training anywhere and anytime.