Humans can interact with several kinds of machine (motor vehicle, robots, among others) in different ways. One way is through his/her head pose. In this work, we propose a head pose estimation framework that combines 2D and 3D cues using the concept of key frames (KFs). KFs are a set of frames learned automatically offline that consist the following: 2D features, encoded through Speeded Up Robust Feature (SURF) descriptors; 3D information, captured by Fast Point Feature Histogram (FPFH) descriptors; and target's head orientation (pose) in real-world coordinates, which is represented through a 3D facial model. Then, the KF information is re-enforced through a global optimization process that minimizes error in a way similar to bundle adjustment. The KF allows to formulate, in an online process, a hypothesis of the head pose in new images that is then refined through an optimization process, performed by the iterative closest point (ICP) algorithm. This KF-based framework can handle partial occlusions and extreme rotations even with noisy depth data, improving the accuracy of pose estimation and detection rate. We evaluate the proposal using two public benchmarks in the state of the art: (1) BIWI Kinect Head Pose Database and (2) ICT 3D HeadPose Database. In addition, we evaluate this framework with a small but challenging dataset of our own authorship where the targets perform more complex behaviors than those in the aforementioned public datasets. We show how our approach outperforms relevant state-of-the-art proposals on all these datasets.