Temporal action segmentation is a task of predicting frame-level classes of untrimmed longtermed videos. It can be widely used in various applications including customer analysis, data collection, video indexing, and surveillance. Learning good representations from videos such as motion and spatial representation is a critical factor for model performance. Furthermore, the used model should make reliable decisions with low latency for practical use in a resource-constrained environment. However, many works in action segmentation consider just an accuracy using the pre-calculated representations from heavy 3-dimensional(3D) networks. In this paper, we propose a two-stream action segmentation pipeline that can learn motion and spatial information efficiently and operate online. While the temporal stream combines frame-grouping and TSM for capturing short-term dynamics and long-term temporal information at the same time, the spatial stream captures information on color and appearance complementary to representations from the temporal stream. In addition, the results of both the streams are combined by a cross-attention module to provide a desired classification result for the task. Since it can be operated without heavy 3D convolutional neural networks (CNNs), it takes much less memory and computation than conventional 3D-CNN-based methods. Our proposed network using a non-overlapping sliding window achieved segmentation performance on two action segmentation datasets comparable to many recent works that require full temporal resolution and pre-calculated features of 3D CNNs.INDEX TERMS Video action segmentation, two-stream network, frame-grouping, temporal shift module, sliding window approach I. INTRODUCTION