Real-time MRI can be used to obtain videos that describe articulatory movements during running speech. For detailed analysis based on a large number of video frames, it is necessary to extract the contours of speech organs, such as the tongue, semi-automatically. The present study attempted to extract the contours of speech organs from videos using a machine learning method. First, an expert operator manually extracted the contours from the frames of a video to build training data sets. The learning operators, or learners, then extracted the contours from each frame of the video. Finally, the errors representing the geometrical distance between the extracted contours and the ground truth, which were the contours excluded from the training data sets, were examined. The results showed that the contours extracted using machine learning were closer to the ground truth than the contours traced by other expert and non-expert operators. In addition, using the same learners, the contours were extracted from other naive videos obtained during different speech tasks of the same subject. As a result, the errors in those videos were similar to those in the video in which the learners were trained.