In this paper, the proposed work tests the computer vision application to perform the skill and emotion assessment of children with Autism Spectrum Disorder (ASD) by extracting various bio-behaviors, human activities, child-therapist interactions, and joint pose estimations from the video-recorded interactive singleor two-person play-based intervention sessions. A comprehensive data set of 300 videos are amassed from ASD children engaged in social interaction and developed three novel deep learning-based computer vision models which are explained as follows: 1) activity comprehension to analyze child-play partner interactions (Activity Comprehension model); 2) an automatic joint attention recognition framework using pose, and 3) emotion and facial expression recognition. We tested models on children's real-world unseen 68 videos captured from the clinic and public datasets. The activity comprehension model has an overall accuracy of 72.32%, the joint attention models have an accuracy of 97% for following eye gaze and 93.4% for hand pointing and the facial expression recognition model has an overall accuracy of 95.1%. The proposed models could extract activities and behaviors of interest from free-play and intervention session videos, empowering clinicians with data useful in diagnosis, assessment, treatment formulation, and monitoring of ASD children with limited supervision.