When aligning the hand to grasp an object, the CNS combines multiple sensory inputs encoded in multiple reference frames. Previous studies suggest that when a direct comparison of target and hand is possible via a single sensory modality, the CNS avoids performing unnecessary coordinate transformations that add noise. But when target and hand do not share a common sensory modality (e.g., aligning the unseen hand to a visual target), at least one coordinate transformation is required. Similarly, body movements may occur between target acquisition and manual response, requiring that egocentric target information be updated or transformed to external reference frames to compensate. Here, we asked subjects to align the hand to an external target, where the target could be presented visually or kinesthetically and feedback about the hand was visual, kinesthetic, or both. We used a novel technique of imposing conflict between external visual and gravito-kinesthetic reference frames when subjects tilted the head during an instructed memory delay. By comparing experimental results to analytical models based on principles of maximum likelihood, we showed that multiple transformations above the strict minimum may be performed, but only if the task precludes a unimodal comparison of egocentric target and hand information. Thus, for cross-modal tasks, or when head movements are involved, the CNS creates and uses both kinesthetic and visual representations. We conclude that the necessity of producing at least one coordinate transformation activates multiple, concurrent internal representations, the functionality of which depends on the alignment of the head with respect to gravity.