Advances in composite technology led to the substitution of conventional, metallic construction material by composites. However, the more widespread application of composites is currently restricted by complex fracture mechanisms, which are not well understood. One approach to overcome this challenge is structural health monitoring systems which provide a lot of information on the current system state as well as state of health in real time. In this context, reliability assessment of structural health monitoring systems is currently an open issue. The reliability of conventional nondestructive testing systems is evaluated, measured, and partly standardized using widely accepted methods such as the probability of detection rate. Frequently, the a 90|95 value, which is determined from the probability of detection curves, is used as a performance measure indicating the minimum damage size that is detected with a probability of 90% and 95% confidence. In contrast to non-destructive testing, structural health monitoring involves additional data analysis steps, that is, statistical pattern recognition, where the classification results are also subject to uncertainty. Because similar methods are not available, the reliability of structural health monitoring systems is usually not quantified. To investigate the influences on the classification performance, experiments were conducted. In particular, the effect of variable loading conditions and the evolution of damage over time are considered. To this end, acoustic emission measurements were performed, while the specimens of the composite material were subjected to different cyclic loading patterns. Here, acoustic emission refers to elastic stress waves in the ultrasound regime, which emerge from the structure on damage initiation and propagation. Furthermore, a frequency-based damage classification scheme for acoustic emission measurements is proposed. Time-frequency domain features are extracted from the measurement signals using shorttime Fourier transform. Classification is performed using support vector machine. Both choices serve as typical examples to discuss the effects which apply equally to other approaches. Experimental results presented in this article regarding fault diagnosis and discrimination of delamination, matrix crack, debonding, and fiber breakage in carbon-fiberreinforced polymer material show that good performance applying support vector machine could be achieved using 10fold cross validation. However, during model deployment, strong dependency of the classification reliability on loading conditions can be clearly stated, which could not be seen from the previous evaluation. Concluding from these results, it can be stated that the application of classifier-based structural health monitoring is more complex than generally assumed. The relations between the classification approaches, testing conditions, measurement devices, and filters have to be discussed with respect to the ability to provide reliable statements about the actual damage state.