Deep Neural Networks (DNNs) have emerged as a powerful tool for human action recognition, yet their reliance on vast amounts of high-quality labeled data poses significant challenges. A promising alternative is to train the network on generated synthetic data. However, existing synthetic data generation pipelines require complex simulation environments. Our novel solution bypasses this requirement by employing Generative Adversarial Networks (GANs) to generate synthetic data from only a small existing real-world dataset.
Our training pipeline extracts the motion from each training video and augments it across various subject appearances within the training set. This approach increases the diversity in both motion and subject representations, thus significantly enhancing the model's performance. A rigorous evaluation of the model's performance is presented under diverse scenarios, including ground and aerial views. Moreover, an insightful analysis of critical factors influencing human action recognition performance, such as gesture motion diversity and subject appearance, is presented.