We introduce a novel hybrid Transformer-CNN architecture for robotic grasp detection, designed to enhance the accuracy of grasping unknown objects. Our proposed architecture has two key designs. Firstly, we develop a hierarchical transformer as the encoder, incorporating the external attention to effectively capture the correlation features across the data. Secondly, the decoder is constructed with cross-layer connections to efficiently fuse multi-scale features. Channel attention is introduced in the decoder to model the correlation between channels and to adaptively recalibrate the channel correlation feature response, thereby increasing the weight of the effective channels. Our method is evaluated on the Cornell and Jacquard public datasets, achieving an image-wise detection accuracy of 98.3% and 95.8% on each dataset, respectively. Additionally, we achieve object-wise detection accuracy of 96.9% and 92.4% on the same datasets. A physical experiment is also performed using the Elite 6Dof robot, with a grasping accuracy rate of 93.3%, demonstrating the proposed method's ability to grasp unknown objects in real-world scenarios. The results of this study show that our proposed method outperforms other state-of-the-art methods.