In Embodied Question Answering (EmbodiedQA), an agent interacts with an environment to gather necessary information for answering user questions. Existing works have laid a solid foundation towards solving this interesting problem. But the current performance, especially in navigation, suggests that EmbodiedQA might be too challenging for the contemporary approaches. In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions. In this new setting, we randomly place a few objects in new environments, and upgrade the agent policy by a distillation network to retain the generalization ability from the trained model. On the EmbodiedQA v1 benchmark, under the standard setting, our simple baseline achieves very competitive results to the-state-of-the-art; in the new setting, we found the introduced small change in settings yields a notable gain in navigation. Index Terms-Embodied question answering, vision and language, visual question answering. I. INTRODUCTION A LONG-STANDING goal of artificial intelligence is to develop agents that can perceive and interact with the environment and communicate with humans in natural language. A representative research area is studying a goal-driven agent that can communicate with humans (language), perceive the environment (vision), and explore the space (taking actions). This paper focuses on a kind of such problem called Embodied Question Answering (EmbodiedQA) [1], a sub-field derived from Visual Question Answering (VQA), where users could ask an agent questions, and to answer these questions, the agent needs to perform actions to navigate the environment and collect evidence. A key difference to related problems, such as visual navigation [2]-[4], is that the agent is only given the first-person view and has no access to the global map of the environment nor the room/object layout in the environment. The example in Fig. 1 illustrates this challenging setting where the agent needs to answer questions about an object at a random location in the environment.