The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesianbased approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4 percentage points (pp) and +6 pp over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.65 less, they generally require less data because they also have 66 fewer underlying free parameters to tune. Therefore their 67 interpretability is less dependent on the available learning 68 data (e.g. less subject to over-fitting). These approaches are 69 relevant in the case of a small number of samples available 70 for training. For example, our previous Bayesian approach for 71 action recognition ANBM (for A New Bayesian Model [9]), 72 models both the interactions between objects and human-73 objects through about 50 parameters. Let us note that our 74 ANBM approach also takes into account the transitions be-75 tween different actions in order to ensure temporal consistency 76 throughout the sequence of actions. 77 Building on the observation of a possible synergy of the two 78 approaches, we propose a hybrid framework with a fusion at 79 the decision level, of a C3D [3] convolutional network and our 80 probabilistic ANBM [9] approach based on explicit human-81 object observations.These two approaches take into account 82 the spatio-temporal characteristics of the different classes of 83 actions. Due to the large number of parameters, the C3D 84 network needs a lot of annotated data to be relevant since 85 learning is difficult in the case of under-represented classes. 86 The ANBM approach depends on handcrafted models and 87 even with a little data the prediction of under-represented 88 classes is possible. 89 Thus, our contributions are: (1) one first minor contribution 90 is the addition of a Gated Recurrent Unit (GRU) recurrent 91 layer to the C3D architecture for action recognition which 92 also models the temporal correlations between actions, (2) 93 the comparison of both approaches (ANBM and C3D-GRU) 94 on two public datasets CAD-120 and Watch-n-Patch, (3) 95 implementation and evaluation of a late fusion mechanism of 96 the predictions of these two approaches and comparison with 97 the literature. We observe a performance gain from this hybrid 98 approach.
99The article is organized as follows. In section 2 we present 100 the state of the art and the context of our work. Then in 101 section 3 we present our hybrid approach for action detection.
102A comparative study of our results is presented in secti...