In the HEP community the prediction of Data Popularity is a topic that has been approached for many years. Nonetheless, while facing increasing data storage challenges, especially in the upcoming HL-LHC era, there is still the need for better predictive models to answer the questions of whether particular data should be kept, replicated, or deleted.
Caches have proven to be a convenient technique for partially automating storage management, potentially eliminating some of these questions. On the one hand, one can benefit even from simple cache eviction policies like LRU, on the other hand, we show that incorporation of knowledge about future access patterns has the potential to greatly improve cache performance.
In this paper, we study data popularity on the file level, where the special relation between files belonging to the same dataset could be used in addition to the standard attributes. We turn to Machine Learning algorithms, such as Random Forest, which is well suited to work with Big Data: it can be parallelized, is more lightweight and easier to interpret than Deep Neural Networks. Finally, we compare the results with standard cache eviction algorithms and the theoretical optimum.