Human activity recognition in videos is a challenging problem that has drawn a lot of interest, particularly when the goal requires the analysis of a large video database.The Advancing Out-of-School Learning in Mathematics and Engineering (AOLME) project provides a collaborative learning environment for middle school students to explore mathematics, computer science, and engineering by processing digital images and videos. As part of this project, around 2200 hours of video data were collected for analysis. This data was collected to understand how children learn in situations involving mathematical and programming challenges so as to recognize best teaching practices that support broadening participation of underrepresented students in STEM fields. Because of the size of the dataset, it is hard to analyze all the videos of the dataset manually. Thus, there is a huge need for reliable computerbased methods that can detect activities of interest.My thesis is focused on the development of accurate methods for detecting and tracking objects in collaborative learning environments in long videos (> 1 hour). v Long-term object detection and tracking face fundamental challenges due to occlusion, illumination variations, and pose variations.For collaborative learning groups, the thesis contributes robust methods for computer keyboard detection, tracking, and student hand detection. For hand detection, the thesis integrates object detection with clustering and time-projections for accurate, long-term assessment of student participation. The hand detection method was integrated into a writing detection system and can also be used for later research on recognizing student gestures.All the models are validated on videos from 7 different sessions, ranging from 45 minutes to 90 minutes. The keyboard detector achieved a very high average precision (AP) of 92% at 0.5 intersection over union (IoU). Furthermore, a combined system of the detector with a fast tracker KCF (159fps) was developed so that the algorithm runs significantly faster without sacrificing accuracy. For a video of 23 minutes having resolution 858 × 480 @ 30 fps, the detection alone runs at 4.7×the real-time, and the combined algorithm runs at 21×the real-time for an average IoU of 0.84 and 0.82, respectively. The hand detector achieved average precision (AP) of 72% at 0.5 intersection over union (IoU). The detection results were improved to 81% using optimal data augmentation parameters. The hand detector runs at 4.7×the real-time with AP of 81% at 0.5 intersection over union. The hand detection method was integrated with projections and clustering for accurate proposal generation. This approach reduced the number of false-positive hand detections by 80%. The overall hand detection system runs at 4×the real-time, capturing all the activity regions of the current collaborative group. vi
Contents
List of Figures x
List of Tables xiii
Glossary xvAR Average Recall.