This paper lies at the intersection of three research areas: human action recognition, egocentric vision, and visual event-based sensors. The main goal is the comparison of egocentric action recognition performance under either of two visual sources: conventional images, or event-based visual data. In this work, the events, as triggered by asynchronous event sensors or their simulation, are spatio-temporally aggregated into event frames (a grid-like representation). This allows to use exactly the same neural model for both visual sources, thus easing a fair comparison. Specically, a hybrid neural architecture combining a convolutional neural network and a recurrent network is used. It is empirically found that this general architecture works for both, conventional graylevel frames, and event frames. This nding is relevant because it reveals that no modication or adaptation is strictly required to deal with event data for egocentric action classication. Interestingly, action recognition is found to perform better with event frames, suggesting that these data provide discriminative information that aids the neural model to learn good features.