Event detection in crowded surveillance videos is a challenging yet important problem. This paper focuses on pair-wise events that involve the interaction of two persons (e.g., people embrace, meet or split) in crowded videos. To detect such an event accurately, we should build an effective representation model that can characterize the sequential properties of two persons' interaction. Towards this end, we propose a novel pair-wise event detection approach using cubic features and sequence discriminant learning. A video sequence is first partitioned into several spatio-temporal cubes, and multiple features (e.g., statistics of trajectories, bag of spatio-temporal interest points) are extracted on these cubes and then fused to form a cubic feature descriptor under multiple kernel learning (MKL) framework. After that, the SVM with dynamic time alignment kernel is used to infer the existence of an event in the video sequence. Experimental results show that the proposed approach achieves the encouraging performance on TRECVid SED dataset.