VideoDG: Generalizing Temporal Relations in Videos to Novel Domains

Yao, Zhiyu; Wang, Yunbo; Wang, Jianmin; Yu, Philip S.; Long, Mingsheng

doi:10.1109/tpami.2021.3116945

Cited by 20 publications

(10 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several techniques have been introduced to solve this problem with deep models (Muandet et al, 2013;Li et al, 2017Li et al, , 2018aMotiian et al, 2017), and with important results for a variety of datasets and data types, but the area is significantly under-explored with respect to video datasets, due to the complexity of entangling spatial and temporal domain shifts. In Yao et al (2019Yao et al ( , 2021, the only recent prominent work in this area, the authors present the Adversarial Pyramid Network (APN), a network capturing the videos' local-, global-, and multi-layer crossrelation features. They also extend an adversarial data augmentation method in Volpi et al (2018), ADA, to videos.…”

Section: Video Domain Generalizationmentioning

confidence: 99%

Ani-GIFs: A benchmark dataset for domain generalization of action recognition from GIFs

Majumdar

Jain

Tourni

et al. 2022

Front. Comput. Sci.

View full text Add to dashboard Cite

Deep learning models perform remarkably well for the same task under the assumption that data is always coming from the same distribution. However, this is generally violated in practice, mainly due to the differences in data acquisition techniques and the lack of information about the underlying source of new data. Domain generalization targets the ability to generalize to test data of an unseen domain; while this problem is well-studied for images, such studies are significantly lacking in spatiotemporal visual content—videos and GIFs. This is due to (1) the challenging nature of misalignment of temporal features and the varying appearance/motion of actors and actions in different domains, and (2) spatiotemporal datasets being laborious to collect and annotate for multiple domains. We collect and present the first synthetic video dataset of Animated GIFs for domain generalization, Ani-GIFs, that is used to study the domain gap of videos vs. GIFs, and animated vs. real GIFs, for the task of action recognition. We provide a training and testing setting for Ani-GIFs, and extend two domain generalization baseline approaches, based on data augmentation and explainability, to the spatiotemporal domain to catalyze research in this direction.

show abstract

Section: Video Domain Generalizationmentioning

confidence: 99%

Ani-GIFs: A benchmark dataset for domain generalization of action recognition from GIFs

Majumdar

Jain

Tourni

et al. 2022

Front. Comput. Sci.

View full text Add to dashboard Cite

show abstract

“…Domain shift in action recognition. In [6,52], crossdomain datasets are introduced to study methods for video domain adaptation. Chen et al [6] propose to align temporal and spatial features across the domains, whereas Yao et al [52] propose to improve the generalizability of so called local features instead of global features, and use a novel augmentation scheme.…”

Section: Related Workmentioning

confidence: 99%

“…In [6,52], crossdomain datasets are introduced to study methods for video domain adaptation. Chen et al [6] propose to align temporal and spatial features across the domains, whereas Yao et al [52] propose to improve the generalizability of so called local features instead of global features, and use a novel augmentation scheme. Strikingly, however, all experiments in [6,52] are based on features extracted frameby-frame, by a ResNet [21], and aggregated after-the-fact, which means that they in effect do not handle spatiotemporal features.…”

Section: Related Workmentioning

confidence: 99%

“…However, the space in which these types of models can be compared is vast, and there likely are important modes of comparisons that we have left out. Another limitation of our work is that we have not run more cross-domain experiments for other datasets, such as the UCF-HMDB full or the Something-Something Cross-Relation benchmark [52] (the Kinetics-Gameplay [6] dataset is not made public except for frame-wise features extracted from a ResNet101). Typically, however, cross-domain benchmarks are small and require pre-training, which thus will be more suited for a continued set of experiments including pre-trained models.…”

Section: D Cnn Convlstmmentioning

confidence: 99%

“…This is important, since there lies information in the inter-dependency of the frames, which might be lost when down-sampling each frame of the sequence before the temporal processing. Despite this positive development, video models lack robustness to domain shift [52,53]. It has been repeatedly shown [7,51] that the action recognition datasets which were most frequently cited during the 2010s (UCF-101 [43], HMDB [28], Kinetics [24], AVA [20], and YouTube8M [1]) exhibit significant spatial biases.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Broomé¹,

Pokropek²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. Previous results for single images have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks [16], reducing generalization. Taken together, this raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time and infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence across the time-axis. In this article, we empirically study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling. Our novel Temporal Shape dataset is proposed as a light-weight dataset to assess the ability to generalize across temporal shapes which are not revealed from single frames. We find that when controlling for performance and layer structure, recurrent models show better out-of-domain generalization ability on the Temporal Shape dataset than convolution-and attention-based models. Moreover, our experiments indicate that convolution-and attention-based models exhibit more texture bias on Diving48 than recurrent models.

show abstract

Test-Time Adaptation for Egocentric Action Recognition

Plizzari

Caputo

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

VideoDG: Generalizing Temporal Relations in Videos to Novel Domains

Cited by 20 publications

References 34 publications

Ani-GIFs: A benchmark dataset for domain generalization of action recognition from GIFs

Ani-GIFs: A benchmark dataset for domain generalization of action recognition from GIFs

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Test-Time Adaptation for Egocentric Action Recognition

Contact Info

Product

Resources

About