Video Transformer Network

Neimark, Daniel; Bar, Omri; Zohar, Maya; Asselmann, Dotan

doi:10.1109/iccvw54120.2021.00355

Cited by 347 publications

(112 citation statements)

References 20 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”

Section: Related Workmentioning

confidence: 99%

“…However, its generic nature and its lack of inductive biases also mean that transformers typically require extremely large amounts of data for training [57,8], or aggressive domain-specific augmentations [72]. This is particularly true for video data, for which transformers are also applicable [51], but where statistical inefficiencies are exacerbated. While videos carry rich temporal information, they can also contain redundant spatial information from neighboring frames.…”

Section: Introductionmentioning

confidence: 99%

“…Videos are similar, except that 3D points move over time, thus projecting on different parts of the image along certain 2D trajectories. Existing video transformer methods [7,2,51] disregard these trajectories, pooling information over the entire 3D space-time feature volume [2,51], or pooling axially across the temporal dimension [7]. We contend that pooling along motion trajectories would provide a more natural inductive bias for video data, allowing the network to aggregate information from multiple views of the same object or region, to reason about how the object or region is moving (for example, the linear and angular velocities), and to be invariant to camera motion.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Patrick¹,

Campbell²,

Asano³

et al. 2021

Preprint

View full text Add to dashboard Cite

In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers-trajectory attention-that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something-Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/ Motionformer. * Equal contribution.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Patrick¹,

Campbell²,

Asano³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Following the vision transformer (ViT) [13], which demonstrates competitive performance against CNN models on image classification, many recent works attempt to extend the vision transformer for action recognition [36,25,3,1,14]. VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction. E.g., VTN [36] processes each frame independently and then uses a longformer to aggregate the features across frames.…”

Section: Related Workmentioning

confidence: 99%

“…VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction. E.g., VTN [36] processes each frame independently and then uses a longformer to aggregate the features across frames. On the other hand, divided-space-time modeling in TimeSformer [4] inserts a temporal attention module into each transformer encoder for more fine-grained temporal interaction.…”

Section: Related Workmentioning

confidence: 99%

Can An Image Classifier Suffice For Action Recognition?

Fan¹,

Chun-Fu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Ki-netics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.

show abstract

Distillation of human–object interaction contexts for action recognition

Almushyti

2022

Computer Animation & Virtual

View full text Add to dashboard Cite

Modeling spatial-temporal relations is imperative for recognizing human actions, especially when a human is interacting with objects, while multiple objects appear around the human differently over time. Most existing action recognition models focus on learning overall visual cues of a scene but disregard a holistic view of human-object relationships and interactions, that is, how a human interacts with respect to short-term task for completion and long-term goal. We therefore argue to improve human action recognition by exploiting both the local and global contexts of human-object interactions (HOIs). In this paper, we propose the Global-Local Interaction Distillation Network (GLIDN), learning human and object interactions through space and time via knowledge distillation for holistic HOI understanding. GLIDN encodes humans and objects into graph nodes and learns local and global relations via graph attention network. The local context graphs learn the relation between humans and objects at a frame level by capturing their co-occurrence at a specific time step.The global relation graph is constructed based on the video-level of human and object interactions, identifying their long-term relations throughout a video sequence. We also investigate how knowledge from these graphs can be distilled to their counterparts for improving HOI recognition. Finally, we evaluate our model by conducting comprehensive experiments on two datasets including Charades and CAD-120. Our method outperforms the baselines and counterpart approaches.

show abstract

Video Transformer Network

Cited by 347 publications

References 20 publications

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Can An Image Classifier Suffice For Action Recognition?

Distillation of human–object interaction contexts for action recognition

Contact Info

Product

Resources

About