In this work, an approach for robot skill learning from voice command and hand movement sequences is proposed. The motion is recorded by a 3D camera. The proposed framework consists of three elements. Firstly, a hand detector is applied on each frame to extract key points, which are represented by 21 landmarks. The trajectories of index finger tip are then taken as hand motion for further processing. Secondly, the trajectories are divided into five segments by voice command and finger moving velocities. These five segments are: reach, grasp, move, position and release, which are considered as skills in this work. The required voice commands are grasp and release, as they have short duration and can be viewed as discrete events. In the end, dynamic movement primitives are learned to represent reach, move and position. In order to show the result of the approach, a human demonstration of a pick-and-place task is recorded and evaluated.