People use how-to videos, both live and pre-recorded, to learn new physical skills [37], ranging from repairing a broken keyboard to learning digital fabrication [42]. In most how-to videos for physical skills, the instructor demonstrates step-by-step how to complete the task [14]. As these steps may involve activities at varying locations in varying levels of detail, a single, fixed camera often cannot record every step with the desired clarity [39]. This necessitates frequent changes to camera parameters, including viewpoints, angles, and zoom levels.Professional video productions, such as cooking and home improvement shows, employ several dedicated camera operators who actively re-position cameras and adjust their parameters in response to the instructor's actions. However, such resources are not available to most instructors; instead, these rely on one or more preconfigured fixed cameras. Although fixed camera setups can be re-configured during recording, instructors need to stop what they are demonstrating (e.g., chopping vegetables) to manipulate the camera. This disrupts the demonstration, increasing the instructor's workload. It also requires more post-processing to combine clips filmed with different camera setups.Both filmmakers and researchers have explored the idea of cameramanipulating robots as an alternative to human operators [1,27]. Recent camera robots (predominantly drones) can autonomously track moving subjects [6,24,41]. However, it remains a challenge for the user being filmed to control the camera robots' behaviors while performing other activities, such as demonstrating a physical process. Conventional interfaces for robot control employ joysticks [52], gestures [49], and speech [16]-all of which require dedicated input actions-that disrupt instruction delivery. If not edited out, such disruptions might split audience's attention and hinder learning [9], but post-processing adds to instructors' efforts. Recent user interface research has explored triggering on-screen visual effects through presenters' gestures and speech [22,32,48] that are part of the presentations. Our approach in this work is to:(1) identify the kinds of camera shots that how-to videos use and (2) direct camera operations in a non-disruptive manner by relying on the communicative signals that instructors already use to address their audience during demonstrations. For instance, an instructor may point to a part of an object to emphasize it, use speech to guide the audience's attention, or wave to introduce themselves.