Group activity recognition in sports is often challenging due to the complex dynamics and interaction among the players. In this thesis, we propose a deep architecture to classify puck possession events in ice hockey. Our model consists of three distinct phases: feature extraction, feature aggregation and, learning and inference. For the feature extraction and aggregation, we use a Convolutional Neural Network (CNN) followed by a late fusion model on top to extract and aggregate different types of features that includes handcrafted homography features for encoding the camera information. The output from the CNN is then passed into a Recurrent Neural Network (RNN) for the temporal extension and classification of the events. The proposed model captures the context information from the frame features as well as the homography features. The individual attributes of the players and the interaction among them is also incorporated using a pre-trained model and team pooling. Our model requires only the player positions on the image and the homography matrix and does not need any explicit annotations for the individual actions or player trajectories, greatly simplifying the input of the system. We evaluate our model on a new Ice Hockey Dataset and a Volleyball Dataset. Experimental results show that our model produces promising results on both these challenging datasets with much simpler inputs compared with the previous work. ii Lay Summary Group activity recognition is the task of determining what a group of people are doing given a single image or a short clip of video. We have looked at group activity recognition in sports videos, particularly ice hockey. Thus given a sequence of images, we aim to classify the sequence into a group activity or event. There are many possible events that can happen in ice hockey but we have looked at a subset of only those events which involve the possession of the puck by the players. Some examples include pass and shot. We have solved this problem by proposing a deep network architecture which takes into account player appearance and contextual information. These features from different sources are fused together and passed into a temporal model to learn the dependencies across the images in the given sequence. iii Preface This thesis is submitted in partial fulfillment of the requirements for a Master of Science Degree in Computer Science. The entire work presented here is original work done by the author, Moumita Roy Tora, performed under the supervision of Professor James J. Little.