As the representatives of brain-inspired models at the neuronal level, spiking neural networks (SNNs) have shown great promise in processing spatiotemporal information with intrinsic temporal dynamics. SNNs are expected to further improve their robustness and computing efficiency by introducing top-down attention at the architectural level, which is crucial for the human brain to support advanced intelligence. However, this attempt encounters difficulties in optimizing the attention in SNNs largely due to the lack of annotations. Here, we develop a hybrid network model with a top-down attention mechanism (HTDA) by incorporating an artificial neural network (ANN) to generate attention maps based on the features extracted by a feedforward SNN. The attention map is then used to modulate the encoding layer of the SNN so that it focuses on the most informative sensory input. To facilitate direct learning of attention maps and avoid labor-intensive annotations, we propose a general principle and a corresponding weakly-supervised objective, which promotes the HTDA model to utilize an integral and small subset of the input to give accurate predictions. On this basis, the ANN and the SNN can be jointly optimized by surrogate gradient descent in an end-to-end manner. We comprehensively evaluated the HTDA model on object recognition tasks, which demonstrates strong robustness to adversarial noise, high computing efficiency, and good interpretability. On the widely-adopted CIFAR-10, CIFAR-100, and MNIST benchmarks, the HTDA model reduces firing rates by up to 50% and improves adversarial robustness by up to 10% with comparable or better accuracy compared with the state-of-the-art SNNs. The HTDA model is also verified on dynamic neuromorphic datasets and achieves consistent improvements. This study provides a new way to boost the performance of SNNs by employing a hybrid top-down attention mechanism.