This paper presents a novel method of estimating temporal offsets between multi-view unsynchronized videos. When synchronizing multiple cameras scattered in a large area with a wide baseline (e.g., a sports stadium, an event hall, etc.), conventional epipolar-based approaches sometimes fail due to the difficulty of robust point correspondences. For such cases, 2D projections of human joints can be robustly associated with each other even in wide baseline videos and can be utilized as corresponding points. However, the detected 2D poses include detection errors in general that cause estimation failures. To address these problems, we introduce the motion rhythm of 2D human joints as a cue for synchronization. The proposed method detects motion rhythms from videos and estimates temporal offsets with the best harmonized motion rhythms. Moreover, we propose a hybrid synchronization algorithm to get sub-frame precision. We demonstrate our method's performance with indoor and outdoor data.