Combining data from multiple sensors to improve the overall robustness and reliability of a classification system has become crucial in many applications, from military surveillance and decision support, to autonomous driving, robotics, and medical imaging. This so-called sensor fusion is especially interesting for fine-grained target classification, in which very specific sub-categories (e.g. ship types) need to be distinguished, a task that can be challenging with data from a single modality. Typical modalities are electro-optical (EO) image sensors, that can provide rich visual details of an object of interest, and radar, that can yield additional spatial information. Defined by the approach used to combine data from these sensors, several fusion techniques exist. For example, late fusion can merge class probabilities outputted by separate processing pipelines dedicated to each of the individual sensor data. In particular, deep learning (DL) has been widely leveraged for EO image analysis, but typically requires a lot of data to adapt to the nuances of a fine-grained classification task. Recent advances in DL on foundation models have shown a high potential when dealing with in-domain data scarcity, especially in combination with few-shot learning. This paper presents a framework to effectively combine EO and radar sensor data, and shows how this method outperforms stand-alone single sensor methods for fine-grained target classification. We adopt a strong few-shot image classification baseline based on foundation models, which robustly handles the lack of in-domain data and exploits rich visual features. In addition, we investigate a weighted and a Bayesian fusion approach to combine target class probabilities outputted by the image classification model and radar kinematic features. Experiments with data acquired in a measurement campaign at the port of Rotterdam show that our fusion method improves on the classification performance of individual modalities.