Through training on large-scale image-text pairs, vision-language models (VLMs) have gained the ability to align visual information with semantic understanding using natural language, thus leading to better performance on downstream tasks on unseen data. Due to a lack of large-scale neuromorphic vision datasets that also include natural language, training with large-scale datasets to achieve generalized understanding is not feasible. Therefore, this work introduces a neuromorphic adapter neural network that leverages CLIP-based semantic understanding with neuromorphic-inspired feature extraction, thereby enhancing the robustness and efficiency of object classification in real-world scenarios. Specifically, by incorporating the temporal information of neuromorphic vision and the multimodal strengths of CLIP, our approach excels in few-shot learning, effectively extending comprehension across object classification tasks. We evaluate our approach on three public datasets: N-Cars, N-Caltech, and N-ImageNet, yielding encouraging classification accuracy after few-shot learning compared to the state-of-the-art models. Moreover, compared to zero-shot inference, our approach achieves +9.92%, +33.65%, and +42.63% improvements under 1-shot, 15-shot, and 20-shot settings in classification accuracy, respectively. Consequently, it demonstrates the effectiveness of adapting pre-trained language-vision models for event data, enabling effective learning and inference even with limited annotated data.