The paper introduces a novel wearable aid, PerceptGuide to help for visually impaired individuals to perceive scene around them. It is designed as a wearable, light weight chest rig bag, that incorporates a monocular camera, ultrasonic sensors, vibration motors, and a mono-earphone, powered by an embedded Nvidia Jetson development board. The system provides directional obstacle alerts through the vibration motors, allowing users to avoid obstacles on their path. A user-friendly pushbutton enables user to inquire about scene information in front of them. The scene details are effectively conveyed through a novel scene understanding approach, that combines multi-scale feature fusion, selfattention models, and a multilayer GRU (Gated Recurrent Unit) architecture on the ResNet50 backbone. The proposed system generates coherent and descriptive captions by capturing image features at different scales, enhancing the quality and contextual understanding of the scene details. The selfattention in both the encoder (ResNet50 + Feature fusion model) and decoder (multilayer GRU), effectively captures long-range dependencies and attend to relevant image regions. The quantitative evaluations conducted on the MSCOCO and Flicker8k datasets show the effectiveness of the model with improved Bleu-67.7, RougeL -47.6, Meteor -22.7 and CIEDR-67.4 scores. The PerceptGuide system exhibits exceptional real-time performance, generating audible captions in just 1.5 to 2 seconds. This rapid response time significantly aids visually impaired individuals in understanding the scenes around them. The qualitative evaluation of the aid emphasizes its real-time performance, demonstrating the generation of context-aware, semantically meaningful captions. This validates its potential as a wearable assistive aid for visually impaired people, with the added advantages of low power consumption, compactness, and a lightweight design.