In this paper we investigate on the potentials to implicitly estimate the Quality of Experience (QoE) of a user of video streaming services by acquiring a video of her face and monitoring her facial expression and gaze direction. To this, we conducted a crowdsourcing test in which participants were asked to watch and rate the quality when watching 20 videos subject to different impairments, while their face was recorded with their PC's webcam. The following features were then considered: the Action Units (AU) that represent the facial expression, and the position of the eyes' pupil. These features were then used, together with the respective QoE values provided by the participants, to train three machine learning classifiers, namely, Support Vector Machine with quadratic kernel, RUSBoost trees and bagged trees. We considered two prediction models: only the AU features are considered or together with the position of the eyes' pupils. The RUSBoost trees achieved the best results in terms of accuracy, sensitivity and area under the curve scores. In particular, when all the features were considered, the achieved accuracy is of 44.7%, 59.4% and 75.3% when using the 5-level, 3level and 2-level quality scales, respectively. Whereas these results are not satisfactory yet, these represent a promising basis.