This paper focuses on semantic task planning, i.e., predicting a sequence of actions toward accomplishing a specific task under a certain scene, which is a new problem in computer vision research. The primary challenges are how to model taskspecific knowledge and how to integrate this knowledge into the learning procedure. In this work, we propose training a recurrent long short-term memory (LSTM) network to address this problem, i.e., taking a scene image (including pre-located objects) and the specified task as input and recurrently predicting action sequences. However, training such a network generally requires large numbers of annotated samples to cover the semantic space (e.g., diverse action decomposition and ordering). To overcome this issue, we introduce a knowledge and-or graph (AOG) for task description, which hierarchically represents a task as atomic actions. With this AOG representation, we can produce many valid samples (i.e., action sequences according to common sense) by training another auxiliary LSTM network with a small set of annotated samples. Furthermore, these generated samples (i.e., task-oriented action sequences) effectively facilitate training of the model for semantic task planning. In our experiments, we create a new dataset that contains diverse daily tasks and extensively evaluate the effectiveness of our approach. How to make tea ? obj: cup attr: empty obj: tea-box attr: close obj: pot attr: hot water Seq 1 Seq 2 A1: {move to, tea-box} A2: {grasp, tea-box} A3: {open, tea-box} A4: {move to, cup} A5: {grasp, cup} A6: {hold, cup} A7: {pour into, cup} A8: {place back, tea-box} A9: {hold, cup} A10: {move to, pot} A11: {grasp, pot} A12: {pour into, cup} A13: {place back, pot} A1: {move to, tea-box} A2: {grasp, tea-box} A3: {open, tea-box} A4: {move to, cup} A5: {grasp, cup} A6: {hold, cup} A7: {pour into, cup} A8: {place back, tea-box} A9: {hold, cup} A10: {move to, water-dis} A11: {pour into, cup} obj: water-dis attr: not empty arXiv:1808.09284v1 [cs.CV]