Federated learning allows a large number of devices to jointly learn a model without sharing data. In this work, we enable clients with limited computing power to perform action recognition, a computationally heavy task. We first perform model compression at the central server through knowledge distillation on a large dataset. This allows the model to learn complex features and serves as an initialization for model fine-tuning. The fine-tuning is required because the limited data present in smaller datasets is not adequate for action recognition models to learn complex spatio-temporal features. Because the clients present are often heterogeneous in their computing resources, we use an asynchronous federated optimization and we further show a convergence bound. We compare our approach to two baseline approaches: fine-tuning at the central server (no clients) and finetuning using (heterogeneous) clients using synchronous federated averaging. We empirically show on a testbed of heterogeneous embedded devices that we can perform action recognition with comparable accuracy to the two baselines above, while our asynchronous learning strategy reduces the training time by 40% relative to synchronous learning.Impact Statement-In order to enable edge devices to perform action recognition using limited computing power, we have developed a federated learning framework that also takes into account the heterogeneity of resources present on each embedded device. CCTV cameras, which are critical to security applications, often require the classification of abnormal activity. Since privacy is crucial, the video data cannot always be transferred to a central server. Hence, the training needs to occur at the edge devices themselves. In order to ensure that downtime on certain devices does not affect the rest of the system, in this paper, we use asynchronous updates. The framework in this paper can be used in a variety of applications, including, but not limited to, industry, defense and government applications.