Fake news has been shown to have a growing negative impact on societies around the world, from influencing elections to spreading misinformation about vaccines. To address this problem, current research has proposed techniques for fake news detection, demonstrating promising results in lab conditions, where models tested on an unseen portion of the same dataset perform well. However, the question of the generalisability of these techniques, and their efficacy in the realworld, is less frequently evaluated. Studies that have looked at generalisability argue that models struggle to distinguish between fake and legitimate news across different topics of news, as well as across different time periods, to the ones on which they have been trained. This prompts the more fundamental question of how well fake news models generalise across news of the same topic and time period. As such, through a series of experiments, this study explores how well popular fake news detection models and features (word-level representations and linguistic cues) generalise across similar news. The first experiment reports high accuracies, when these techniques are tested on an unseen portion of the same dataset, replicating the findings in literature. However, the second experiment reveals that these techniques struggle to generalise well, suffering drops in accuracy of around 50%, when tested against different datasets of the same topic and time period. Exploring possible reasons behind such poor generalisability, the analysis points to the issue of dataset size, motivating the need for larger, more diverse datasets to become available. It also suggests that word-level representations lead to more biased, less generalisable models. Finally, the findings provide preliminary support for the effectiveness of linguistic and stylistic features, and for the potential of features beyond the word or language level, such as URL redirections and reverse image searches.