During social interactions, humans are capable of initiating and responding to rich and complex social actions despite having incomplete world knowledge as well as physical, perceptual and computational constraints. This capability relies on action perception mechanisms, which exploit regularities in observed goal-oriented behaviours to generate robust predictions, and reduce the workload of sensing systems. To achieve this essential capability, we argue that the following three factors are fundamental. Firstly, human knowledge is frequently hierarchically structured, both in the perceptual and execution domains. Secondly, human perception is an active process driven by current task requirements and context. This is particularly important when the perceptual input is complex (e.g. human motion) and the agent has to operate under embodiment constraints. Thirdly, learning is at the heart of action perception mechanisms, underlying the agent's ability to add new behaviours to its repertoire. Based on these factors, we review multiple instantiations of a hierarchically-organised biologically-inspired framework for embodied action perception, demonstrating its flexibility in addressing the rich computational contexts of action perception and learning in robotic platforms.