Building robots capable of interacting in an effective and autonomous way with their environments requires to provide them with the ability to model the world. That is to say, the robot must interpret the environment not as a set of points, but as an organization of more complex structures with human-like meaning. Among the variety of sensory inputs that could be used to equip a robot, vision is one of the most informative ones. Through vision, the robot can analyze the appearance of objects. The use of stereo vision also gives the possibility to extract spatial information of the environment, allowing to determine the structure of the different elements composing it. However, vision suffers from some limitations when it is considered in isolation. On one hand, cameras have a limited field of view that can only be compensated through camera movements. On the other hand, the world is formed by non-convex structures that can only be interpreted by actively exploring the environment. Hence, the robot must move its head and body to give meaning to perceived elements composing its environment. The combination of stereo vision and active exploration provides a means to model the world. While the robot explores the environment perceived regions can be clustered, forming more complex structures like walls and objects on the floor. Nevertheless, even in simple scenarios with few rooms and obstacles, the robot must be endowed with different abilities to successfully solve the task. For instance, during exploration, the robot must be able to decide where to look at while selecting where to go, avoiding obstacles and detecting what is that it is looking at. From the point of view of perception, there are different visual behaviors that take part in this process, such as those related to look towards what the robot can recognize and model, or those dedicated to maintain itself within safety limits. From the action perspective, the robot has to move in different ways depending on internal states (i.e. the status of the modeling process) and external situations (i.e. obstacles in the way to a target position). Both perception and action should influence each other in such a way that deciding where to look at depends on what the robot is doing, but also in a way that what is being perceived determines what the robot can or can not do. Our solution to all these questions relies heavily on visual attention. Specifically, the foundation of our proposal is that attention can organize the perceptual and action processes by acting as an intermediary between both of them. The attentional connection allows, on one hand, to drive the perceptual process according to the behavioral requirements and, on the other hand, to modulate actions on the basis of the perceptual results of the attentional control. Thus, attention solves the where to look problem and, additionally, attention prevents