Deep learning, a subset of machine learning, allows computers to perform certain tasks, such as image or video recognition, with human level performance. However, deep models need huge amounts of data to learn from, which requires that experts spend their time in the repetitive and non-scalable task of labelling datasets. Active learning suggests that one can minimize the cost of annotation if a model is allowed to smartly choose the best data samples to be labelled. Therefore, we propose a deep and active learning approach that aims to minimize the labelling effort while maximizing the performance of a model for a certain task. We present the task of detecting fish in Remote Operated Vehicles (ROV) videos as a real world problem in which our framework can be successfully applied. To start with, we demonstrate that active learning outperforms random sampling, which is the simplest approach for building a dataset. Besides, we study several active learning settings for the given task, namely different acquisition and aggregation functions. Finally, the proposed methodology is shown to achieve top performance in detecting fish by using only 19% of the available data, thus reducing the cost of building our fish dataset by more than 80%.