Human Hands as Probes for Interactive Object Understanding

Goyal, Mohit; Modi, Sahil; Goyal, Rohit; Gupta, Saurabh

doi:10.1109/cvpr52688.2022.00329

Cited by 21 publications

(6 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some other approaches ground these action labels to images by predicting heatmaps that indicate interaction possibilities [14,35,48,57,60]. While heatmaps only specify where to interact without telling what to do, recent approaches predict richer properties such as contact distance [39], action trajectory [48,55], grasping categories [23,50], etc. Instead of predicting more sophisticated interaction states, we explore directly synthesizing HOI images for possible interactions because images demonstrate both where and how to interact comprehensively and in a straightforward manner.…”

Section: Related Workmentioning

confidence: 99%

Affordance Diffusion: Synthesizing Hand-Object Interactions

Ye¹,

Li²,

Gupta³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Affordance Diffusion: Synthesizing Hand-Object Interactions

Ye¹,

Li²,

Gupta³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, the computer vision community is increasingly interested in understanding 3D dynamics of objects. Researchers try to understand the 3D shapes, axes, movable parts and affordance on synthetic data [42,64,43,25,62,32,60], videos [47,21,20,44] or point clouds [26]. Our work is mostly related to [47,21,20] since they work on real images, but is different from them on two aspectives: First, they need video or multi-view inputs, but our input is only a single image.…”

Section: Related Workmentioning

confidence: 99%

“…Researchers try to understand the 3D shapes, axes, movable parts and affordance on synthetic data [42,64,43,25,62,32,60], videos [47,21,20,44] or point clouds [26]. Our work is mostly related to [47,21,20] since they work on real images, but is different from them on two aspectives: First, they need video or multi-view inputs, but our input is only a single image. Second, their approaches recover the objects which are being interacted, while our approach understands potential interactions before any interactions happen.…”

Section: Related Workmentioning

confidence: 99%

Understanding 3D Object Articulation in Internet Videos

Qian

Jin

Rockwell

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Image with query points (a) Localization and properties (b) Affordance and action (c) Potential interaction Movable: 1 hand Rigid: Yes Movable: 2 hands Rigid: No Action: Pull Movable: No Figure 1. Given a single image and a set of query points, our approach predicts: (a) whether the object at the location can be moved , its rigidity and articulation class , and location ; (b) an affordance and action ; and (c) potential 3D interaction for articulated objects. This ability can assist intelligent agents to better manipulate objects or explore the 3D scene.

show abstract

“…Articulated object pose estimation is a crucial and fundamental computer vision problem with a wide range of applications in robotics, human-object interaction, and augmented reality Katz & Brock (2008); Mu et al (2021); Labbé et al (2021); Jiang et al (2022); Goyal et al (2022); Li et al (2020b). Different from 6D pose estimation for rigid objects Tremblay et al (2018); Xiang et al (2017); Sundermeyer et al (2018); Wang et al (2019a), articulated object pose estimation requires a hierarchical pose understanding on both the object-level and part-level Li et al (2020a).…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

Liu¹,

Zhang²,

Hu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets. The project page with code and more information can be found at: equi-articulated-pose.github.io.

show abstract

Human Hands as Probes for Interactive Object Understanding

Cited by 21 publications

References 48 publications

Affordance Diffusion: Synthesizing Hand-Object Interactions

Affordance Diffusion: Synthesizing Hand-Object Interactions

Understanding 3D Object Articulation in Internet Videos

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

Contact Info

Product

Resources

About