Abstract-This paper describes a complete vision-based framework that enables a humanoid robot to perform simple manipulations in a domestic environment. Our system emphasizes autonomous operation with minimal a priori knowledge in an unstructured environment, with robustness to visual distractions and calibration errors. For each new task, the robot first acquires a dense 3D image of the scene using our novel stereoscopic light stripe scanner that rejects secondary reflections and cross-talk. A data-driven analysis of the range map identifies and models simple objects using geometric primitives. Objects are reliably tracked through clutter and occlusions by exploiting multimodal cues (colour, texture and edges). Finally, manipulations are performed by controlling the end-effector using a hybrid position-based visual servoing scheme that fuses visual and kinematic measurements and compensates for calibration errors. Two domestic tasks are implemented to evaluate the performance of the framework: identifying and grasping a yellow box without any prior knowledge of the object, and pouring rice from an interactively selected cup into a bowl.