Semi-supervised video object segmentation (semi-VOS) is required for many visual applications. This task is tracking class-agnostic objects from a given segmentation mask. Various approaches have been developed and achieved high accuracy in this field, but these previous models are hard to be utilized in real-world applications due to slow inference time and tremendous complexity. To significantly speed up inference while reducing performance gaps from those previous models, we introduce a fast segmentation model based on a template matching method and auxiliary loss with a transition map. Our template matching method consists of short-term and long-term matching. The short-term matching enhances target object localization by focusing on neighboring frames, while long-term matching improves fine details and handles object shape-changing by considering long-range frames. However, since both matching processes generate each template based on the previously estimated masks, this incurs error propagation for tracking objects in the next frames. To mitigate this problem, we add auxiliary loss with a newly proposed transition map for encouraging correction power to create accurate masks of the target object. Our model obtains 81.1% J &F score at the speed of 78.3 FPS on the DAVIS16 benchmark and achieves 1.4× faster speed and 11.3% higher accuracy than SiamMask, one of the fast semi-VOS models.INDEX TERMS Semi-supervised video object segmentation, video object segmentation, video object tracking, deep learning.