With the rapid development of Internet of Things technology, the image data on the Internet are growing at an amazing speed. How to describe the semantic content of massive image data is facing great challenges. Attentional mechanisms originate from the study of human vision. In cognitive science, due to bottlenecks in information processing, humans selectively attend to a portion of all information while ignoring the rest of the visible information. This study mainly discusses the natural language description generation method of Internet of Things intelligent image based on attention mechanism. In this study, a CMOS sensor based on Internet of Things technology is used for image data acquisition and display. FPGA samples cis16bit parallel port data, writes FIFO, stores image data, and then transmits it to host computer for display through network interface. In order to minimize the value of cross-entropy loss function, maximum-likelihood estimation is used to maximize the joint probability of word sequences in the language model when sentence descriptions are generated using the encoder-decoder framework. At each moment, in addition to image features, additional text features are input. Image feature vector and text feature vector are weighted and summed by attention mechanism at each time. In decoding, the attention mechanism gives each image region feature weight, and the long-term and short-term memory network decodes in turn, but the long-term and short-term memory network has limited decoding ability. We use bidirectional long-term and short-term memory network instead of long-term memory network, and dynamically focus on context information through forward LSTM and reverse LSTM. The specificity of the proposed network is 5% higher than that of the 3D convolution residual link network. The results show that the performance of image description model is improved by inputting image context and text context into long-term memory network decoder.