End-to-end robot learning has achieved a great success for robots to obtain various manipulation skills. It learns a function which maps visual information to robotic action directly. Because of the diversity of target objects, most end-to-end robot learning approaches have focused on a single object-specific task with a limited capability of generalization. In this work, an object detection-based one-shot learning method is proposed, which separates the semantic understanding from robot control. It enables a robot to acquire similar manipulation skills efficiently and to have the ability to cope with new objects with a single demonstration. This approach mainly has two modules: the object detection network and the motion policy network. With RGB images, the object detection network tries to output the task-related semantic keypoint of the target object, which is the center of the container in this application, and the motion policy network generates the motion action based on the depth map and the detected keypoint. To evaluate this proposed pipeline, a series of experiments are conducted on typical placing tasks in different simulation scenarios and, additionally, the learned policy is transferred from simulation to the real world without any fine-tuning.