Semantic action recognition aims to classify actions based on the associated semantics, which can be applied in video captioning and human‐machine interaction. In this paper the problem is addressed by jointly learning multiple pose lexicons based on multiple body parts. Specifically, multiple visual pose models are learnt, and one visual pose model is associated with one body part, which characterises the likelihood of an observed video frame being generated from hidden visual poses. Moreover, multiple pose lexicon models are simultaneously learnt along with visual pose models. One pose lexicon model is associated with one body part that establishes a probabilistic mapping between the hidden visual poses and semantic poses parsed from textual instructions. To capture the temporal relations among body parts, a transition model is also learnt to measure the probability of the alignment transitioned from one position to another position. The body part‐based pose lexicon learning provides a novel method of cross‐modality semantic correlation, which can be applied in other spatial and temporal data. Action classification is finally formulated as the problem of finding the maximum posterior probability that a given multiple sequences of visual frames follow multiple sequences of semantic poses, subject to the most likely visual pose sequences and alignment sequences. Experiments were conducted on five action datasets to validate the effectiveness of the proposed method.