Image captioning is a promising technique for remote monitoring of patient behavior, enabling healthcare providers to identify changes in patient routines and conditions. In this study, we explore the use of transformer neural networks for image caption generation from surveillance camera footage, captured at regular intervals of one minute. Our goal is to develop and evaluate a transformer neural network model, trained and tested on the COCO (common objects in context) dataset, for generating captions that describe patient behavior. Furthermore, we will compare our proposed approach with a traditional convolutional neural network (CNN) method to highlight the prominence of our proposed approach. Our findings demonstrate the potential of transformer neural networks in generating natural language descriptions of patient behavior, which can provide valuable insights for healthcare providers. The use of such technology can allow for more efficient monitoring of patients, enabling timely interventions when necessary. Moreover, our study highlights the potential of transformer neural networks in identifying patterns and trends in patient behavior over time, which can aid in developing personalized healthcare plans.