The use of robots in various industries is evolving from mechanization to intelligence and precision. These systems often comprise parts made of different materials and thus require accurate and comprehensive target identification. While humans perceive the world through a highly diverse perceptual system and can rapidly identify deformable objects through vision and touch to prevent slipping or excessive deformation during grasping, robot recognition technology mainly relies on visual sensors, which lack critical information such as object material, leading to incomplete cognition. Therefore, multimodal information fusion is believed to be key to the development of robot recognition. Firstly, a method of converting tactile sequences to images is proposed to deal with the obstacles of information exchange between different modalities for vision and touch, which overcomes the problems of the noise and instability of tactile data. Subsequently, a visual-tactile fusion network framework based on an adaptive dropout algorithm is constructed, together with an optimal joint mechanism between visual information and tactile information established, to solve the problem of mutual exclusion or unbalanced fusion in traditional fusion methods. Finally, experiments show that the proposed method effectively improves robot recognition ability, and the classification accuracy is as high as 99.3%.