Over decades, many researchers developed complex in-lab systems with the overall goal to track multiple body parts of the user for a richer and more powerful 2D/3D interaction with a distant display. In this work, we introduce a novel smartphone-based tracking approach that eliminates the need for complex tracking systems. Relying on simultaneous usage of the front and rear smartphone cameras, our solution enables rich spatial interactions with distant displays by combining touch input with hand-gesture input, body and head motion, as well as eye-gaze input. In this paper, we firstly present a taxonomy for classifying distant display interactions, providing an overview of enabling technologies, input modalities, and interaction techniques, spanning from 2D to 3D interactions. Further, we provide more details about our implementation—using off-the-shelf smartphones. Finally, we validate our system in a user study by a variety of 2D and 3D multimodal interaction techniques, including input refinement.