Hardness recognition of objects can improve the grasping ability of robots. Controlled robots can obtain useful information in complex environments and have wide applications in industrial fields and special material measurement. Existing hardness recognition networks usually only consider tactile or visual information, for many fields where it is not possible to distinguish between areas that cannot be viewed directly or collected pressure information data and cannot be widely used. To address such issues, this paper proposes a multi-time scale convolutional neural network based on visual tactile fusion network (MTSCNN-VTF), which consists of two parts: Multi Time Scale Resnet Network (MTSRN) and Separable Multi Time Scale Resnet Network (SMTSRN) and extracts spatiotemporal feature vectors of data through a two-part network, The fused feature vector MTSCNN-VTF network is used to identify the hardness level of an object. The ablation study conducted on the STAG dataset in this article has verified the rationality of our method design. The comprehensive comparison with other advanced methods proves the effectiveness of the MTSCNN-VTF model.