You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Qin, Yefeng; Chen, Lei; Ben, Xianye; Yang, Mingqiang

doi:10.1007/s00530-023-01254-z

Multimedia Systems

2024

DOI: 10.1007/s00530-023-01254-z

|View full text |Cite

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Yefeng Qin,

Lei Chen,

Xianye Ben

et al.

Abstract: The task of spatio-temporal action localization (STAL) needs to detect the action and position of individuals in the scene. Many works focus on how to improve the accuracy, but they usually ignore inference speed and practical applications. To address the above problems, we propose a new end-to-end spatio-temporal action localization network called You Watch Once More (YWOM). In this work, there are three measures proposed to improve the accuracy of positioning and recognition while guaranteeing the inference … Show more

Help me understand this report

View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article2

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Online spatio-temporal action detection with adaptive sampling and hierarchical modulation

Su,

Gan

2024

Multimedia Systems

View full text Add to dashboard Cite

Online spatio-temporal action detection with adaptive sampling and hierarchical modulation

Su,

Gan

2024

Multimedia Systems

View full text Add to dashboard Cite

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

Liu,

Li,

Tong

et al. 2024

Sci Rep

View full text Add to dashboard Cite

This paper aims to propose a faster and more accurate network for human spatiotemporal action localization tasks. Like the YOWO model, we also use convolutional neural networks (CNNs) for feature extraction, but our model differs from YOWO in three significant ways: firstly, we don’t use the feature fusion strategy, we only use spatial features extracted by 2D CNNs for action localization and spatiotemporal features extracted by 3D CNNs for action recognition; secondly, we make an improvement to the 2D CNNs network by introducing a coordinate attention mechanism and utilize the CIoU loss instead of the coordinate offset loss for bounding box regression; thirdly, we provide a more lightweight and faster spatiotemporal action localization architecture, which reduces the number of parameters by 21.76 million and achieves a speed of 39 fps on 16-frame input clips compared to the YOWO model. We test our model’s performance on three public datasets: UCF-Sports, JHMDB-21 and UCF101-24. Compared with the YOWO model, we improve frame-mAP (@IoU 0.5) by 17.09% and 7.15% on the UCF-Sports and JHMDB-21 datasets, and for video-mAP, we improve by 2.7%, 8.7% and 14.4% at IoU thresholds of 0.2, 0.5 and 0.75 on the JHMDB-21 dataset.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Cited by 2 publications

References 57 publications

Online spatio-temporal action detection with adaptive sampling and hierarchical modulation

Online spatio-temporal action detection with adaptive sampling and hierarchical modulation

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

Contact Info

Product

Resources

About