Recognizing targets through side-scan sonar (SSS) data by deep learning-based techniques has been particularly challenging. The primary challenge stems from the difficulty and time consumption associated with underwater acoustic data acquisition, which demands systematic explorations to obtain sufficient training samples for accurate deep learning-based models. Moreover, if the sample size of the available data is small, the design of effective target recognition models becomes complex. These challenges have posed significant obstacles to developing accurate SSS-based target recognition methods via deep learning models. However, utilizing multi-modal datasets to enhance the recognition performance of sonar images through knowledge transfer in deep networks appears promising. Owing to the unique statistical properties of various modal images, transitioning between different modalities can significantly increase the complexity of network training. This issue remains unresolved, directly impacting the target transfer recognition performance. To enhance the precision of categorizing underwater sonar images when faced with a limited number of mode types and data samples, this study introduces a crossed point-to-point second-order self-attention (PPCSSA) method based on double-mode sample transfer recognition. In the PPCSSA method, first-order importance features are derived by extracting key horizontal and longitudinal point-to-point features. Based on these features, the self-supervised attention strategy effectively removes redundant features, securing the second-order significant features of SSS images. This strategy introduces a potent low-mode-type small-sample learning method for transfer learning. Classification experiment results indicate that the proposed method excels in extracting key features with minimal training complexity. Moreover, experimental outcomes underscore that the proposed technique enhances recognition stability and accuracy, achieving a remarkable overall accuracy rate of 99.28%. Finally, the proposed method maintains high recognition accuracy even in noisy environments.