IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Goyal, Ankit; Mousavian, Arsalan; Paxton, Chris; Chao, Yu-Wei; Okorn, Brian; Jia, Dechang; Fox, Dieter

doi:10.48550/arxiv.2202.00732

Cited by 2 publications

(6 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The use of depth input has also been extensively studied. Methods like CLIPort [3] and IFOR [1] directly process the RGB-D images for object manipulation, and hence are limited to simple pickand-place tasks in 2D top-down settings. To overcome this issue, explicit 3D representations such as point clouds have been utilized.…”

Section: Related Workmentioning

confidence: 99%

“…Learning a single model for many different tasks has been of particular interest to the robotics community recently. A large volume of work achieves the multi-task generalization by using a generalizable task or action representation such as object point cloud [18,19], semantic segmentation and optical flow [1], and object-centric representation [29,30]. However, the limited expressiveness of such representations constrains them to only generalize within a task category.…”

Section: Related Workmentioning

confidence: 99%

“…More visualizations of the task setups and the model performance are also provided. 1 Ablation Study. We conduct ablation experiments to analyze different design choices of RVT: (a) the resolution of the rendered images ("Im.…”

Section: Simulation Experimentsmentioning

confidence: 99%

“…A popular class of learning methods directly processes image(s) viewed from single or multiple cameras. These view-based methods have achieved impressive success on a variety of pick-and-place and object rearrangement tasks [1,2,3,4]. However, their success on tasks that require 3D reasoning has been limited.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Goyal

Mousavian²,

Paxton³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability. In this work, we propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. Some key features of RVT are an attention mechanism to aggregate information across views and re-rendering of the camera input from virtual views around the robot workspace. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing stateof-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (∼10) demonstrations per task. Visual results, code, and trained model are provided at: https://robotic-view-transformer.github.io/.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Simulation Experimentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Goyal

Mousavian²,

Paxton³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…In our approach, we use unknown object instance segmentation to break our scene up into objects, as per prior work [6], [7], [8], [9]. Then, we use a multi-modal transformer to combine both word tokens and object encodings from Point Cloud Transformer [10] in order to make 6-DoF goal pose predictions.…”

Section: Introductionmentioning

confidence: 99%