Abstract-This paper studies the application of modern deep convolutional and recurrent neural networks to video classification, specifically human action recognition. Multistream architecture, which uses the ideas of representation learning to extract embeddings of multimodal features, is proposed. It is based on 2D convolutional and recurrent neural networks, and the fusion model receives a video embedding as input. Thus, the classification is performed based on this compact representation of spatial, temporal and audio information. The proposed architecture achieves 93.1% accuracy on UCF101, which is better than the results obtained with the models that have a similar architecture, and also produces representations which can be used by other models as features; anomaly detection using autoencoders is proposed as an example of this.