2022
DOI: 10.48550/arxiv.2206.01908
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video-based Human-Object Interaction Detection from Tubelet Tokens

Abstract: We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Exp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 17 publications
0
1
0
Order By: Relevance
“…Tubelet Inputs As the spatial and temporal dimensions of the tactile signals can be redundant, directly adopting the whole data in classification may result in reduced efficiency. Motivated by previous video transformer models that convert the video clip into tubelets to alleviate the spatiotemporal redundancy, we follow these studies by transferring the tactile signals into a tubelet sequence (Arnab et al 2021b;Liu et al 2021;Fan et al 2021;Tu et al 2022). We define a tubelet as Q ∈ R L×P ×P , where L represents its sequence length (i.e., the number of frames) and P represents the patch size (i.e., height and width).…”
Section: Spatio-temporal Aware Transformer Encodermentioning
confidence: 99%
“…Tubelet Inputs As the spatial and temporal dimensions of the tactile signals can be redundant, directly adopting the whole data in classification may result in reduced efficiency. Motivated by previous video transformer models that convert the video clip into tubelets to alleviate the spatiotemporal redundancy, we follow these studies by transferring the tactile signals into a tubelet sequence (Arnab et al 2021b;Liu et al 2021;Fan et al 2021;Tu et al 2022). We define a tubelet as Q ∈ R L×P ×P , where L represents its sequence length (i.e., the number of frames) and P represents the patch size (i.e., height and width).…”
Section: Spatio-temporal Aware Transformer Encodermentioning
confidence: 99%