The spatio-temporal action detection task requires the output of the temporal and spatial positions as well as the action category of target action instances in the form of action tubes. However, the current definition of video-level metrics in spatio-temporal action detection tasks is not clear and unified enough to fully describe the ability of network models to perform spatio-temporal detection. Furthermore, existing tube linking methods are not only heavily dependent on the quality of the detection stage but also lack reliable linking criteria, resulting in poor tube linking performance. To address these issues, this study proposes a hierarchical linking method based on multiple clues, abbreviated as MCHL. This method first dynamically utilizes various correlation clues at two levels, including appearance features, spatial overlap, motion prediction, category scores, tube length, and tube confidence status, to reduce the negative impact of unreliable information on correlation. Then, it employs inter-class correlation to handle the mutual influence between different categories, followed by using joint probability data allocation to address the mutual influence between correlated objects, ultimately achieving robust and accurate online linking of action tubes. The method is experimentally compared with other correlation methods on the untrimmed UCF24 and MultiSports datasets, demonstrating state-of-the-art tube link performance. We also conduct ablation experiments to explore the impact of different modules and stages in the proposed tube linking method.INDEX TERMS MCHL, spatio-temporal action detection, linking method, untrimmed video mAP.