Semantics extracted by filters in deep learning networks correlate well with how human eyes perceive distortions. These methods (e.g., LPIPS, PieAPP, etc.) rely on the relative difference in activation between feature maps in pairs of references and distorted patches. However, Deep Feature extraction can be expensive to compute as a difference of latent code between reference and distorted frames. Therefore, it is challenging to integrate them into the decision process of modern video codecs like AV1, making thousands of encoding trials during exhaustive Rate-Distortion Optimization (RDO) searches. In this study, we present a method using deep features to predict the distortion perceived locally by human eyes in AV1-encoded videos. The prediction relies on Deep Features extracted from the reference frame only to weigh the Mean Squared Error (MSE) introduced during encoding. This approach will make integration into video codecs easier as a pre-processing step before starting encoding. We show the superiority of the proposed metric against other Reference-Only metrics on a dataset of local distortions in videos. We achieve comparable performance as state-of-the-art Full-Reference video quality metrics.