“…The input data consists of a sequence of T video frames x 1:T = (x 1 , ..., x T ), force-torque measurements f t 1:T = (f t 1 , ..., f t T ) 6 and the state of the gripper g 1:T = (g 1 , ...g T ), where g t ∈ {-0.5, 0.0, 0.5} refers to {open, partially closed, and closed}, respectively. Additional annotations are the current human action h 1:T = (h 1 , ..., h T ) being performed, where h t ∈ {idle, approach, interact, retract, post-idle, not released, dropped} and which is only available during training, and the current robot action r 1:T = (r 1 , ..., r T ) being performed, where r t ∈ {approach, interact, retract} and which is available during training and inference time.…”