2022
DOI: 10.48550/arxiv.2203.06173
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Visual Pre-training for Motor Control

Abstract: This paper shows that self-supervised visual pretraining from real-world images is effective for learning motor control tasks from pixels. We first train the visual representations by masked modeling of natural images. We then freeze the visual encoder and train neural network controllers on top with reinforcement learning. We do not perform any task-specific fine-tuning of the encoder; the same visual representations are used for all motor control tasks. To the best of our knowledge, this is the first self-su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
48
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(48 citation statements)
references
References 40 publications
0
48
0
Order By: Relevance
“…Recent approaches have also demonstrated that simple data augmentations can sometimes be effective even without such representation learning objectives [80,81,82]. The work closest to ours is Xiao et al [83], which demonstrated that frozen representations from MAE pre-trained on real-world videos can be used for training RL agents on visual manipulation tasks. Our work differs in that we demonstrate that training a self-supervised ViT with reconstruction and convolutional feature masking can be more effective for visual control tasks when compared to MAE that masks pixel patches.…”
Section: Discussionmentioning
confidence: 91%
See 1 more Smart Citation
“…Recent approaches have also demonstrated that simple data augmentations can sometimes be effective even without such representation learning objectives [80,81,82]. The work closest to ours is Xiao et al [83], which demonstrated that frozen representations from MAE pre-trained on real-world videos can be used for training RL agents on visual manipulation tasks. Our work differs in that we demonstrate that training a self-supervised ViT with reconstruction and convolutional feature masking can be more effective for visual control tasks when compared to MAE that masks pixel patches.…”
Section: Discussionmentioning
confidence: 91%
“…Our work differs in that we demonstrate that training a self-supervised ViT with reconstruction and convolutional feature masking can be more effective for visual control tasks when compared to MAE that masks pixel patches. We also note that our work is orthogonal to Xiao et al [83] in that our framework can also initialize the autoencoder with pre-trained parameters.…”
Section: Discussionmentioning
confidence: 99%
“…Pre-training representations for RL One common way pre-training has been used in RL is to improve the representations that are used in the policy network. Self-supervised representation learning techniques distill knowledge from external datasets to produce downstream features that are helpful in virtual environments [Du et al, 2021, Xiao et al, 2022. Some recent work shows benefits from pre-training on more general, large-scale datasets.…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we study allocation policies for multiple humans and multiple robots. While many existing works [58,59,60,61,62] have leveraged NVIDIA's Isaac Gym's [11] capability of parallel simulation to accelerate reinforcement learning with multiple robots, these approaches consider robots in isolation without interactive human supervision. The work that is closest to ours is by Swamy et al [63], who study the multi-robot, single-human setting of allocating the attention of one human operator during robot fleet learning.…”
Section: Multi-robot Interactive Learningmentioning
confidence: 99%