The generation of bimanual object manipulation sequences given a semantic action label has broad applications in collaborative robots or augmented reality. This relatively new problem differs from existing works that generate whole-body motions without any object interaction as it now requires the model to additionally learn the spatio-temporal relationship that exists between the human joints and object motion given said label. To tackle this task, we leverage the varying degree each muscle or joint is involved during object manipulation. For instance, the wrists act as the prime movers for the objects while the finger joints are angled to provide a firm grip. The remaining body joints are the least involved in that they are positioned as naturally and comfortably as possible. We thus design an architecture that comprises 3 main components: (i) a graph recurrent network that generates the wrist and object motion, (ii) an attention-based recurrent network that estimates the required finger joint angles given the graph configuration, and (iii) a recurrent network that reconstructs the body pose given the locations of the wrist. We evaluate our approach on the KIT Motion Capture and KIT RGBD Bimanual Manipulation datasets and show improvements over a simplified approach that treats the entire body as a single entity, and existing whole-body-only methods.