Presentation attacks are becoming a serious threat to one of the most common biometric applications, namely face recognition (FR). In recent years, numerous methods have been presented to detect and identify these attacks using publicly available datasets. However, such datasets are often collected in controlled environments and are focused on one specific type of attack. We hypothesise that a model’s accurate performance on one or more public datasets does not necessarily guarantee generalisation across other, unseen face presentation attacks. To verify our hypothesis, in this paper, we present an experimental framework where the generalisation ability of pre-trained deep models is assessed using four popular and commonly used public datasets. Extensive experiments were carried out using various combinations of these datasets. Results show that, in some circumstances, a slight improvement in model performance can be achieved by combining different datasets for training purposes. However, even with a combination of public datasets, models still could not be trained to generalise to unseen attacks. Moreover, models could not necessarily generalise to a learned format of attack over different datasets. The work and results presented in this paper suggest that more diverse datasets are needed to drive this research as well as the need for devising new methods capable of extracting spoof-specific features which are independent of specific datasets.