This article endeavors to enhance image recognition technology within the context of the Internet of Things (IoT). A dynamic image target detection training model is established through the convolutional neural network (CNN) algorithm within the framework of deep learning (DL). Three distinct model configurations are proposed: a nine-layer convolution model, a seven-layer convolution model, and a residual module convolution model. Subsequently, the simulation model of CNN image target detection based on optical imaging is constructed, and the simulation experiments are conducted in scenarios of simple and salient environments, complex and salient environments, and intricate micro-environment. By determining the optimal training iterations, comparisons are drawn in terms of precision, accuracy, Intersection Over Union (IoU), and frames per second (FPS) among different model configurations. Finally, an attention mechanism is incorporated within the DL framework, leading to the construction of an attention mechanism CNN target detection model that operates at three difficulty levels: simple, intermediate, and challenging. Through comparative analysis against prevalent target detection algorithms, this article delves into the accuracy and detection efficiency of various models for IoT target detection. Key findings include: (1) The seven-layer CNN model exhibits commendable accuracy and confidence in simple and salient environments, although it encounters certain instances of undetected images, indicating scope for improvement. (2) The residual network model, when employing a loss function comprising both mean square error (MSE) and cross entropy, demonstrates superior performance in complex and salient environments, manifesting high precision, IoU, and accuracy metrics, thereby establishing itself as a robust detection model. (3) Within intricate micro-environments, the residual CNN model, utilizing loss functions of MSE and cross entropy, yields substantial results, with precision, IoU, and FPS values amounting to 0.99, 0.83, and 29.9, respectively. (4) The CNN model enriched with an attention mechanism outperforms other models in IoT target image detection, achieving the highest accuracy rates of 24.86%, 17.8%, and 14.77% in the simple, intermediate, and challenging levels, respectively. Although this model entails slightly longer detection times, its overall detection performance is excellent, augmenting the effectiveness of object detection within IoT. This article strives to enhance image target detection accuracy and speed, bolster the recognition capability of IoT systems, and refine dynamic image target detection within IoT settings. The implications encompass reduced manual recognition costs and the provision of a theoretical foundation for optimizing imaging and image target detection technologies in the IoT context.