A hand gesture recognition system provides a robust and innovative solution to nonverbal communication through human–computer interaction. Deep learning models have excellent potential for usage in recognition applications. To overcome related issues, most previous studies have proposed new model architectures or have fine-tuned pre-trained models. Furthermore, these studies relied on one standard dataset for both training and testing. Thus, the accuracy of these studies is reasonable. Unlike these works, the current study investigates two deep learning models with intermediate layers to recognize static hand gesture images. Both models were tested on different datasets, adjusted to suit the dataset, and then trained under different methods. First, the models were initialized with random weights and trained from scratch. Afterward, the pre-trained models were examined as feature extractors. Finally, the pre-trained models were fine-tuned with intermediate layers. Fine-tuning was conducted on three levels: the fifth, fourth, and third blocks, respectively. The models were evaluated through recognition experiments using hand gesture images in the Arabic sign language acquired under different conditions. This study also provides a new hand gesture image dataset used in these experiments, plus two other datasets. The experimental results indicated that the proposed models can be used with intermediate layers to recognize hand gesture images. Furthermore, the analysis of the results showed that fine-tuning the fifth and fourth blocks of these two models achieved the best accuracy results. In particular, the testing accuracies on the three datasets were 96.51%, 72.65%, and 55.62% when fine-tuning the fourth block and 96.50%, 67.03%, and 61.09% when fine-tuning the fifth block for the first model. The testing accuracy for the second model showed approximately similar results.