A challenging problem in human action understanding is to jointly segment and recognize human actions from an unseen video sequence, where one person performs a sequence of continuous actions.In this paper, we propose a discriminative semiMarkov model approach, and define a set of features over boundary frames, segments as well as neighboring segments. This enable us to conveniently capture a combination of local and global features that best represent a specific action type. To efficiently solve the inference problem of simultaneously segmentation and recognition, we devise a Viterbi-like dynamic programming algorithm, which is able to process 20 frames per second in practice. Moreover, the model is discriminatively learned from large margin principle, and is formulated as an optimization problem with exponentially many constraints. To solve it efficiently, we present two different optimization algorithms, namely cutting plane method and bundle method, and demonstrate that each can be alternatively deployed in a "plug and play" fashion. From its theoretical aspect, we also analyze the generalization error of the proposed approach and provide a PAC-Bayes bound.A preliminary version has been published at [28].Qingfeng Shi NICTA and ANU, Canberra, Australia E-mail: qinfeng.shi@rsise.anu.edu.au Li Wang Southeast University, Nanjing, China E-mail: wang.li.seu.nj@gmail.com Li Cheng TTI-Chicago, USA E-mail: licheng@tti-c.orgThe proposed approach is evaluated on a variety of datasets, and is shown to perform competitively to the state-of-the-art methods. For example, on KTH dataset, it achieves 95% ± 0.01 recognition accuracy, where the best known result on this dataset is 92% ± 0.03 [8].