Multi-modality three-dimensional (3D) object detection is a crucial technology for the safe and effective operation of environment perception systems in autonomous driving. In this study, we propose a method called context clustering-based radar and camera fusion for 3D object detection (ConCs-Fusion) that combines radar and camera sensors at the intermediate fusion level to achieve 3D object detection. We extract features from heterogeneous sensors and input them as feature point sets into the fusion module. Within the fusion module, we utilize context cluster blocks to learn multi-scale features of radar point clouds and images, followed by upsampling and fusion of the feature maps. Then, we leverage a multi-layer perceptron to nonlinearly represent the fused features, reducing the feature dimensionality to improve model inference speed. Within the context cluster block, we aggregate feature points of the same object from different sensors into one cluster based on their similarity. All feature points within the same cluster are then fused into a radar–camera feature fusion point, which is self-adaptively reassigned to the originally extracted feature points from a simplex sensor. Compared to previous methods that only utilize radar as an auxiliary sensor to camera, or vice versa, the ConCs-Fusion method achieves a bidirectional cross-modal fusion between radar and camera. Finally, our extensive experiments on the nuScenes dataset demonstrate that ConCs-Fusion outperforms other methods in terms of 3D object detection performance.