“…The earlier approaches are mostly based on a combination of hand-crafted visual features with probability graphical models [1,31,30,43,6,8,17] or AND-OR grammar models [2,46]. Recently, the wide adoption of deep convolutional neural networks (CNNs) has demonstrated significant performance improvements on group activity recognition [3,24,41,45,12,32,59,23,39]. Ibrahim et al [24] designed a two-stage deep temporal model, which builds a LSTM model to represent action dynamics of individual people and another LSTM model to aggregate personlevel information.…”