“…Several automatic approaches facilitated the video editing process by automatically applying edits based on predetermined markers [25], placing transitions and cuts in interview videos [9], adding visuals to audio travel podcasts [103], selecting appropriate clips for dialogue-driven scenes [60], adding lyric text to music videos [73], and placing cuts by matching the user's voice-over annotations with relevant segments of the videos [97]. Other systems bootstrapped the editing process by generating videos from documents [22,23], web pages [24,52], text-based instructions [108], recipe texts [98], and articles [61] or synthesized talking head videos of puppets [32] and used deep learning methods to automatically generate speech animations [95]. However, these automated approaches restrict the editor's control over the editing process by providing only predetermined input formats for interactions (e.g., markers, annotations), which in turn inhibits the expressiveness.…”