“…Leveraging the capabilities of video recognition backbones [69], [70], [71], which provide representative features, and adopting the end-to-end learning paradigm [36], which simplifies complex designs, the field has seen significant advancements. In the realm of supervised approaches, the anchor mechanism has seen notable developments, resulting in one-stage methods [33], [39], [72], [73], two-stage methods [14], [36], [52], [74], and anchor-free methods [44], [75], [76], [77]. On the other hand, in the context of weakly supervised methods, the community has introduced the pre-classification pipeline [2], [78], [79], [80] and the postclassification pipeline [20], [54], [81], [82].…”