In the space station, visual gesture recognition is an important component for gesture based human-robot interaction to control the robot, which assists astronauts to complete simple, repetitive and cooperative work. However, public gesture datasets are captured in the daily environment and existing approaches trained on these datasets achieve imprecise performance for astronaut gesture recognition. In this paper, we introduce a new astronaut gesture dataset (DSSL-Astronaut gesture dataset) and a novel hierarchical attention single-shot detector network (HA-SSD) for astronaut gesture recognition. Specifically, this dataset consists of the real and augmented images. The real images captured in the simulated space station are closed to the real images in the space station environment. Meanwhile, we utilize the Mask-RCNN model to segment the foreground image including an astronaut from real data and capture the background images of the simulated space station at different views and illuminations. Then we combine them to synthesize the augmented images. A novel HA-SSD model consists of a lightweight backbone named MobileNet and a hierarchical attention mechanism. The MobileNet is used as the feature extractor, which contains several depth-wise separable convolutions to make a trade-off between latency and accuracy. Meanwhile, the hierarchical channel-wise attention module is adopted to exploit the fine semantic information to enrich the features for improving the performance of gesture recognition. We conduct experiments to demonstrate that our dataset is suitable for the space station and our approach is able to effectively localize and recognize the gestures with strong generality.