Figure 1: Overall architecture of our question-answer pipeline AQuA, which generates useful responses to questions made in software tutorial videos. Questions are accompanied by visual anchors, which are specific visual elements of interest in the video. The Visual Recognition Module generates a textual description of the visual anchor. Combining the description with the question, the Retrieval Module retrieves relevant articles to the queries. Resources in yellow boxes are software-specific materials (in this case, for Fusion 360). Along with these retrieved articles, the question text, and the visual anchor description, we include the title and relevant transcript sentences of the tutorial video and feed them into GPT-4 through crafted prompts.