2022
DOI: 10.48550/arxiv.2205.04749
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild

Abstract: Previous methods for dynamic facial expression in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. To solve this problem, we propose the spatio-temporal Transformer (STT) to capture discriminative features within each frame and model contextual relationships among frames. Spatio-temporal dependencies are captured and integrated by our unified Transformer. Specifically, given an image sequence consisting of multiple frames as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 36 publications
0
8
0
Order By: Relevance
“…2. We provide a strong baseline method by utilizing the vanilla spatial and temporal attention in [11]. The results in Tab.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…2. We provide a strong baseline method by utilizing the vanilla spatial and temporal attention in [11]. The results in Tab.…”
Section: Methodsmentioning
confidence: 99%
“…For example, Li et al [9] exploit the bidirectional Transformers to capture the temporal information among frames. Directly extending the vanilla Transformer [10,11] for DFER needs to perform the multi-head selfattention jointly across all S spatial locations and T temporal locations. Namely, the full space-time attention that has complexity O(T 2 S 2 ) puts heavy computational burdens within the vanilla Transformer framework for efficient dynamic facial expression recognition.…”
Section: Introductionmentioning
confidence: 99%
“…Specifically, Zhao et al devised a dynamic facial expression recognition transformer (Former-DFER) consisting of CS-Former and T-Former for learning spatial and temporal features, respectively . Ma et al proposed a spatio-temporal transformer (STT), which can get the spatial and temporal information jointly by a transformer-based encoder (Ma, Sun, and Li 2022). Additionally, Li et al introduced a NR-DFERNet for suppressing the impact of noisy frames in video sequences (Li et al 2022).…”
Section: Related Work Dfer In the Wildmentioning
confidence: 99%
“…In recent years, with the development of parallel computing hardware and collection of largescale DFER datasets (Wang et al 2022;Jiang et al 2020), deep learning-based methods have gradually replaced the algorithms based on hand-crafted features and achieved stateof-the-art performance on the in-the-wild DFER datasets. For instance, vision transformer (ViT) (Dosovitskiy et al 2020) has obtained promising results on many computer vision tasks, which inspires many researchers to build DFER models based on ViT Ma, Sun, and Li 2022). Since transformer has strong robustness against severe occlusion and disturbance (Naseer et al 2021), these transformer-based approaches mostly deal with various interferences in practical scenarios (e.g., variant head poses, poor illumination, and occlusions) by utilizing both spatial transformer and temporal transformer.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation