Video-based Human-Object Interaction Detection from Tubelet Tokens

Tu, Danyang; Sun, Wei; Min, Xiongkuo; Zhai, Guangtao; Shen, Wei

doi:10.48550/arxiv.2206.01908

Cited by 1 publication

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tubelet Inputs As the spatial and temporal dimensions of the tactile signals can be redundant, directly adopting the whole data in classification may result in reduced efficiency. Motivated by previous video transformer models that convert the video clip into tubelets to alleviate the spatiotemporal redundancy, we follow these studies by transferring the tactile signals into a tubelet sequence (Arnab et al 2021b;Liu et al 2021;Fan et al 2021;Tu et al 2022). We define a tubelet as Q ∈ R L×P ×P , where L represents its sequence length (i.e., the number of frames) and P represents the patch size (i.e., height and width).…”

Section: Spatio-temporal Aware Transformer Encodermentioning

confidence: 99%

Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification

Lin,

Li,

Gao

et al. 2024

AAAI

View full text Add to dashboard Cite

Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics.

show abstract

Section: Spatio-temporal Aware Transformer Encodermentioning

confidence: 99%