Due to population ageing, the cost of health care will raise in the coming years. One way to help humans, and especially elderly people, is the introduction of domestic robots that can assist people in daily life such that they are less dependent on home care. Joint visual attention models can be used for natural robot-human interaction. Joint visual attention is that two humans or a robot and a human have a shared attention to the same object. This can be accomplished by pointing, eye-gaze or by using speech. The goal of this thesis is to develop a non verbal joint visual attention model for object detection that integrates gestures, gaze, saliency and depth. The question that will be answered in this report is: how can the information from gestures, gaze, saliency and depth be integrated in the most efficient way to determine the object of interest?Existing joint visual attention models only work when the human is in front of the robot, so that the human is in view of the camera. Our model should be more flexible than existing models, so it needs to work in different configurations of human, robot and object. Furthermore, the joint visual attention model should be able to determine the object of interest when the pointing direction or the gaze location is not available.The saliency algorithm of Itti et al. [1] has been used to create a bottom up saliency map. The second bottom-up cue, depth, is determined by means of segmenting the environment to extract the objects. Apart from the bottom-up cues, top-down cues can be used as well. The pointing finger is identified and based on the eigenvalues and eigenvectors of the finger the pointing direction will be retrieved. A pointing map is created by means of the angle between the 3D pointing direction vector and the 3D vector from the pointing finger to the object. A hybrid model, which computes a gaze map, has been developed that switches depending on textureness of the object between texture based approach and color based approach.Depending on the configuration of the human, robot and object, three or four maps are available to determine the object of interest. In some configurations, the pointing map or gaze map is not available. In that case the combined saliency map is obtained by point wise multiplication of these three maps. If all four maps are at our disposal, all maps are added and multiplied by the pointing mask.When the human and robot are opposite of each other and pointing, bottom up saliency and depth are combined, 93.3% of the objects are detected correctly. If the human is standing next to the robot, the gaze map, bottom up saliency map and depth map are combined, then the detection rate is 67.8%. If robot, human and object are standing in a triangular shape, the detection rate is equal to 96.3%.The main contribution is that the joint visual attention model is able to detect objects of interest in different configurations of human, robot and object and it also works when one of the four cues is not available. Furthermore, a hybrid model has ...