Object detection is an important perception task in autonomous driving and advanced driver assistance systems. The visible camera is widely used for perception, but its performance is limited by illumination and environmental variations. For robust vision-based perception, we propose a deep learning framework for effective sensor fusion of the visible camera with complementary sensors. A feature-level sensor fusion technique, using skip connection, is proposed for the sensor fusion of the visible camera with the millimeter-wave radar and the thermal camera. The two networks are called the RV-Net and the TV-Net, respectively. These networks have two input branches and one output branch. The input branches contain separate branches for the individual sensor feature extraction, which are then fused in the output perception branch using skip connections. The RVNet and the TVNet simultaneously perform sensor-specific feature extraction, feature-level fusion and object detection within an end-to-end framework. The proposed networks are validated with baseline algorithms on public datasets. The results obtained show that the feature-level sensor fusion is better than baseline early and late fusion frameworks.