In this paper we describe a Wizard of Oz (WOz) user study of an Augmented Reality (AR) interface that uses multimodal input (MMI) with natural hand interaction and speech commands. Our goal is to use a WOz study to help guide the creation of a multimodal AR interface which is most natural to the user. In this study we used three virtual object arranging tasks with two different display types (a head mounted display, and a desktop monitor) to see how users used multimodal commands, and how different AR display conditions affect those commands. The results provided valuable insights into how people naturally interact in a multimodal AR scene assembly task. For example, we discovered the optimal time frame for fusing speech and gesture commands into a single command. We also found that display type did not produce a significant difference in the type of commands used. Using these results, we present design recommendations for multimodal interaction in AR environments.