Most industrial workplaces involving robots and other apparatus operate behind the fences to remove defects, hazards, or casualties. Recent advancements in machine learning can enable robots to co-operate with human co-workers while retaining safety, flexibility, and robustness. This article focuses on the computation model, which provides a collaborative environment through intuitive and adaptive human–robot interaction (HRI). In essence, one layer of the model can be expressed as a set of useful information utilized by an intelligent agent. Within this construction, a vision-sensing modality can be broken down into multiple layers. The authors propose a human-skeleton-based trainable model for the recognition of spatiotemporal human worker activity using LSTM networks, which can achieve a training accuracy of 91.365%, based on the InHARD dataset. Together with the training results, results related to aspects of the simulation environment and future improvements of the system are discussed. By combining human worker upper body positions with actions, the perceptual potential of the system is increased, and human–robot collaboration becomes context-aware. Based on the acquired information, the intelligent agent gains the ability to adapt its behavior according to its dynamic and stochastic surroundings.