SUMMARYThis paper proposes a natural facial and head behavior recognition method using hybrid dynamical systems. Most existing facial and head behavior recognition methods focus on analyzing deliberately displayed prototypical emotion patterns rather than complex and spontaneous facial and head behaviors in natural conversation environments. We first capture spatio-temporal features on important facial parts via dense feature extraction. Next, we cluster the spatio-temporal features using hybrid dynamical systems, and construct a dictionary of motion primitives to cover all possible elemental motion dynamics accounting for facial and head behaviors. With this dictionary, the facial and head behavior can be interpreted into a distribution of motion primitives. This interpretation is robust against different rhythms of dynamic patterns in complex and spontaneous facial and head behaviors. We evaluate the proposed approach under natural tele-communication scenarios, and achieve promising results. Furthermore, the proposed method also performs favorably against the state-ofthe-art methods on three benchmark databases.