Trespassing endangers the security of individuals and property, disrupts social order, undermines social trust and increases the number of social groups used to maintain social order. In this paper, a new contribution as a method to combat trespassing which involves the monitoring of human behavior for prediction is presented. This method includes two parts: image and text description. In this work we investigate lightweight human behavior detection models based on YOLO-v5 and RNN. We use the same dataset for different models and study various model metrics (e.g., model accuracy and running speed) to compare the performance of different models. For image and video, we used pruning algorithm to lightweight the YOLO-v5 model while ensuring accuracy. For text description, we used different Image-Caption (RNN and CLIP) models to describe human behavior. Finally, corresponding validation experiments were implemented to validate the method proposed in this paper.