This paper introduces a visual method for diver detection in the context of Human Robot Interaction (HRI). The detection is treated as a classification problem, where a discriminative model is trained by computing image features of the target (diver) and underwater scenery. This type of scenery poses great challenges due to its high variability, as it often presents high illumination changes, scarce features and image distortions. For this reason, it is desirable to represent this type of images with multiple type of complementary features. The system scalability is, however, lowered as the number of features types increase as the amount of data to represent queries and indexes also increases.To remedy this, we modified the Nearest Class Mean Forests (NCMF) method, a variant of Random Forests, to integrate as many features types as desired without concerning about scalability and performance decay. The system outperforms the common generative tracking methods which fail to encompass different type of distortions into one model and ignore background information. And in contrast to tracking methods using acoustic sensors which output a single value (distance to the diver), our approach outputs a region encompassing the diver's body; information that can be further exploited to enhance underwater HRI. Not to mention that camera setups offer higher flexibility in size and energy consumption constraints than acoustic sensors. All of the system's aforementioned capabilities are tested with real-life data obtained from field experiments.978-1-4799-8736-8/15/$31.00 ©2015 IEEE