“…With our proposed emulated Python console prompting, we differ from these existing works by (i) formatting and interpreting all interaction with the LLM as Python code, in contrast to [6], [8], (ii) closing the interaction loop by enabling the LLM to reason about each perception and action outcome, in contrast to [7], [32], [34], [31], [6], (iii) allowing the LLM to decide itself when and which perception primitives to invoke, instead of providing a predefined list of observations (usually a list of objects in the scene) as part of the prompt as in [31], [8], [32], [7], [35], and (iv) simplifying the task for the LLM by allowing it to generate one statement at a time, in contrast to [7], [32], [33].…”