Designing multimodal systems that take the best advantage of multiple error prone recognition-based technologies, such as speech and gesture recognition, is difficult. To guarantee a robust and usable interaction, careful consideration must be given to the choice of modalities of interaction made available, their allocation to tasks, and the range of modality combinations allowed. In this paper, we present a conceptual framework for evaluating the usability and robustness of different interaction modality combinations early in the process of designing a multimodal system. First, models of multimodal elementary commands are built using Finite State Machine (FSM) modelling. Second, the most usable representations of multimodal commands are flagged up in the model by activating the FSM-based model with real or simulated user inputs. The output of this step is a collection of FSMs that more closely represent user preferences and natural behaviour. Third, the disambiguating potential of sets of multimodal commands is evaluated by observing the models response to simulated recognition errors.