“…Despite the gratifying success of these approaches, most of the two-stage pipelines for video object detection are over sophisticated, requiring many hand-crafted components, e.g., optical flow model [26], [27], [28], [29], [30], recurrent neural network [23], [25], [31], deformable convolution fusion [21], [32], [33], relation networks [23], [34], [35]. In addition, most of them need complicated post-processing methods by linking the same object across the video to form tubelets and aggregating classification scores in the tubelets to achieve the state-of-the-art performance [12], [13], [14], [15]. Meanwhile, there are also several related studies [16], [17], [33], [36], [37], [38] focusing on real-time video object detection.…”