In recent years, the machine learning community has devoted an increasing attention to selfsupervised learning.The performance gap between supervised and self-supervised has become increasingly narrow in many computer vision applications. In this paper, a new self-supervised approach is proposed for learning audio-visual representations from large databases of unlabeled videos. Our approach learns its representations by a combination of two tasks: unimodal and cross-modal. It uses a future prediction task, and learns to align its visual representations with its corresponding audio representations. To implement these tasks, three methodologies are assessed: contrastive learning, prototypical constrasting and redundancy reduction. The proposed approach is evaluated on a new publicly available dataset of videos captured from video game gameplay footage, called Videogame DB. On most downstream tasks, our method significantly outperforms baselines, demonstrating the real benefits of self-supervised learning in a real-world application.