Sequential Learning for Ingredient Recognition From Images

Zhang, Mengyang; Tian, Guohui; Zhang, Ying; Liu, Hong

doi:10.1109/tcsvt.2022.3218790

Cited by 4 publications

(1 citation statement)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recent great success in video action recognition gets the merits of convolutional neural networks (ConvNets) and transformers. They learn the temporal combination of the spatial features, demonstrating the importance of temporal information [1,2]. Nevertheless, the huge amount of model parameters require high-performance hardware, which hinders its application.…”

mentioning

confidence: 99%

Empowering lightweight video transformer via the kernel learning

Liu,

2024

Electronics Letters

View full text Add to dashboard Cite

Video transformers achieve superior performance in video recognition. Despite the recent advances in video transformers, they still require substantial computation and memory resources. To cater for the computation efficiency, a kernel‐based video transformer is proposed, including: (1) a new formulation of the video transformer via the kernel learning is presented to better understand the individual components of it; (2) a lightweight Kernel‐based spatial–temporal multi‐head self‐attention block is explored to learn the compact joint spatial–temporal video feature; (3) an adaptive‐score position embedding method is conducted to promote the flexibility of video transformer. Experimental results on several action recognition datasets demonstrate the effectiveness of the proposed method. Only pretrained on ImageNet‐1K, the method achieves the preferable balance between computation and accuracy, while requiring 7 fewer parameters and 13 fewer floating point operations than other comparable methods.

show abstract

mentioning

confidence: 99%