Driver sleepiness is a contributing factor in many road fatalities. A long-standing goal in driver state research has therefore been to develop a robust sleepiness detection system. It has been suggested that various heart rate variability (HRV) metrics can be used for driver sleepiness classification. However, since heart rate is modulated not only by sleepiness but also by several other time-varying intra-individual factors such as posture, distress, boredom and relaxation, it is relevant to highlight not only the possibilities but also the difficulties involved in HRV-based driver sleepiness classification. This paper investigates the reliability of HRV as a standalone feature for driver sleepiness detection in a realistic setting. Data from three real-road driving studies were used, including 86 drivers in both alert and sleep-deprived conditions. Subjective ratings based on the Karolinska sleepiness scale (KSS) were used as ground truth when training four binary classifiers (k-nearest neighbours, support vector machine, AdaBoost, and random forest). The best performance was achieved with the random forest classifier with an accuracy of 85%. However, the accuracy dropped to 64% for three-class classification and to 44% for subject-independent, leave-one-participant-out classification. The worst results were obtained in the severely sleepy class. The results show that in realistic driving conditions, subject-independent sleepiness classification based on HRV is poor. The conclusion is that more work is needed to control for the many confounding factors that also influence HRV before it can be used as input to a driver sleepiness detection system.