Marine mammals and their ecosystem face significant threats from, for example, military active sonar and marine transportation. To mitigate this harm, early detection and classification of marine mammals are essential. While recent efforts have utilized spectrogram analysis and machine learning techniques, there remain challenges in their efficiency. Therefore, we propose a novel knowledge distillation framework, named XCFSMN, for this problem. We construct a teacher model that fuses the features extracted from an X-vector extractor, a DenseNet and Cross-Covariance attended compact Feed-Forward Sequential Memory Network (cFSMN). The teacher model transfers knowledge to a simpler cFSMN model through a temperature-cooling strategy for efficient learning. Compared to multiple convolutional neural network backbones and transformers, the proposed framework achieves state-of-the-art efficiency and performance. The improved model size is approximately 20 times smaller and the inference time can be 10 times shorter without affecting the model’s accuracy.