This paper proposes the object localization and depth estimation to select and set goals for robots via machine vision. An algorithm based on a deep region-based convolution neural network (R-CNN) will recognize targets and non-targets. After the targets are recognized, we employed both the k-nearest neighbors (kNN) and the fuzzy inference system (FIS) to localize two-dimension (2D) positions. Moreover, based on the field of view (FoV) and a disparity map, the depth is estimated by a mono camera mounted on the end-effector with an eye-in-hand manipulator structure. Although using a single mono camera, the system can easily find the camera baseline by only shifting the end-effector a few millimeters towards the x-axis. Thus, we can obtain and identify the depth of the layered environment in 3D points, which form a dataset to recognize the junction box covers on the table. Experimental tests confirmed that the algorithm could accurately distinguish junction box covers or non-targets and could estimate whether the targets are within the depth for grasping by three-finger grippers. Furthermore, the proposed optimized depth error of-0.0005%, and localization method could precisely position the junction box cover with recognizing and picking error rates 0.993 and 98.529% respectively. INDEX TERMS Region-based convolution neural network, eye-in-hand manipulator, machine vision, robotics, and automation.