<p>Feature selection (FS) represents an essential step for many machine learning-based predictive maintenance (PdM) applications, including various industrial processes, components, and monitoring tasks. The selected features do not only serve as inputs to the learning models, but also they can influence further decisions and analysis, e.g., sensor selection, understandability of the PdM system. Hence, before deploying the PdM system, it is crucial to examine the reproducibility and robustness of the selected features under variations in the input data. This is particularly critical for the real-world datasets with a low sample-to-dimension ratio (SDR). However, to the best of our knowledge, stability of the FS methods under data variations has not been considered yet in the field of PdM. This paper addresses this issue with an application to tool condition monitoring in milling, where classifiers based on support vector machine and random forest were employed. We used a 5-fold cross-validation to evaluate three popular filter-based FS methods, namely Fisher score, maximum relevance minimum redundancy (mRMR), and ReliefF, in terms of both stability and macro-F1. Further, for each method, we investigated the impact of the homogeneous FS ensemble on both performance indicators. To gain broad insights, we used four (2:2) milling datasets generated from our experiments and NASA’s repository, which differ in the operating conditions, sensors, SDR, number of classes, etc. Among the findings: 1) Different FS methods can yield comparable macro-F1, yet considerably different FS stability values. 2) Fisher score (single and/or ensemble) is the superior in most of the cases. 3) mRMR’s stability is overall the lowest, the most variable over different settings, e.g., sensor(s), subset cardinality, and the one most benefiting from the ensemble. </p>