Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence 2021
DOI: 10.24963/ijcai.2021/104
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Abstract: This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned by detecting spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, as we demonstrate in the experime… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…While modeling time remains a challenge, it also presents a natural source of supervision that has been exploited for self-supervised learning. For example, as a proxy signal by posing pretext tasks involving spatio-temporal jigsaw [1,43,52], video speed [10,16,47,94,109,123], arrow of time [78,80,112], frame/clip ordering [24,70,90,97,116], video continuity [60], or tracking [44,106,111]. Several works have also used contrastive learning to obtain spatio-temporal representations by (i) contrasting temporally augmented versions of a clip [46,77,81], or (ii) encouraging consistency between local and global temporal contexts [9,17,85,122].…”
Section: Time In Visionmentioning
confidence: 99%
“…While modeling time remains a challenge, it also presents a natural source of supervision that has been exploited for self-supervised learning. For example, as a proxy signal by posing pretext tasks involving spatio-temporal jigsaw [1,43,52], video speed [10,16,47,94,109,123], arrow of time [78,80,112], frame/clip ordering [24,70,90,97,116], video continuity [60], or tracking [44,106,111]. Several works have also used contrastive learning to obtain spatio-temporal representations by (i) contrasting temporally augmented versions of a clip [46,77,81], or (ii) encouraging consistency between local and global temporal contexts [9,17,85,122].…”
Section: Time In Visionmentioning
confidence: 99%
“…Modelling the temporal dynamics is essential for a genuine understanding of videos. Hence, it is widely explored in both supervised [20,35,48,49,63,70] and self-supervised paradigm [28,29,34,36,39]. Self-supervised approaches learns temporal modelling by solving various pre-text tasks, such as dense future prediction [28,29], jigsaw puzzle solving [36,39], and pseudo motion classification [34], etc.…”
Section: Related Workmentioning
confidence: 99%
“…Hence, it is widely explored in both supervised [20,35,48,49,63,70] and self-supervised paradigm [28,29,34,36,39]. Self-supervised approaches learns temporal modelling by solving various pre-text tasks, such as dense future prediction [28,29], jigsaw puzzle solving [36,39], and pseudo motion classification [34], etc. Supervised video recognition explores various connections between different frames, such as 3D convolutions [62], temporal convolution [63], and temporal shift [48], etc.…”
Section: Related Workmentioning
confidence: 99%