Specific emitter identification (SEI) refers to the task of distinguishing similar emitters, especially those of the same type and transmission parameters, which is one of the most critical tasks of electronic warfare. However, SEI is still a challenging task when a feature has low physical representation. Feature representation largely determines the recognition results. Therefore, this article expects to move toward robust feature representation for SEI. Efficient multimodal strategies have great potential for applications using multimodal data and can further improve the performance of SEI. In this research, we introduce a multimodal emitter identification method that explores the application of multimodal data, time-series radar signals, and feature vector data to an enhanced transformer, which employs a conformer block to embed the raw data and integrates an efficient multimodal feature representation module. Moreover, we employ self-knowledge distillation to mitigate overconfident predictions and reduce intra-class variations. Our study reveals that multimodal data provide sufficient information for specific emitter identification. Simultaneously, we propose the CV-CutMixOut method to augment the time-domain signal. Extensive experiments on real radar datasets indicate that the proposed method achieves more accurate identification results and higher feature discriminability.