Normally, we do not act within a single effector system only, but rather coordinate actions across several output modules (cross-modal action). Such cross-modal action demands can vary substantially with respect to their complexity in terms of the number of task-relevant response combinations and to-be-retrieved stimulus–response (S–R) mapping rules. In the present study, we study the impact of these two types of cross-modal action complexity on dual-response costs (i.e., performance differences between single- and dual-action demands). In Experiment 1, we combined a manual and an oculomotor task, each involving four response alternatives. Crucially, one (unconstrained) condition involved all 16 possible combinations of response alternatives, whereas a constrained condition involved only a subset of possible response combinations. The results revealed that preparing for a larger number of response combinations yielded a significant, but moderate increase in dual-response costs. In Experiment 2, we utilized one common lateralized auditory (e.g., left) stimulus to trigger incompatible response compounds (e.g., left saccade and right key press or vice versa). While one condition only involved one set of task-relevant S–R rules, another condition involved two sets of task-relevant rules (coded by stimulus type: noise/tone), while the number of task-relevant response combinations was the same in both conditions. Here, an increase in the number of to-be-retrieved S–R rules was associated with a substantial increase in dual-response costs that were also modulated on a trial-by-trial basis when switching between rules. Taken together, the results shed further light on the dependency of cross-modal action control on both action- and rule-related memory retrieval processes.