The rampant spread of explicit content across social media can leave a damaging mark on our society. Hence, the need to be vigilant in detecting and curtailing sexually explicit content cannot be overstated. As such, it becomes paramount to discern and manage sexually explicit material to curb its dissemination and safeguard our digital communities from its harmful effects. In this article, we propose a unique technique entitled attention‐enabled pooling (ABP) embedded Swin transformer‐based YOLOv3 (ASYv3) for the detection of obscene areas present in the images with a bounding box around the offensive regions. ASYv3 employs a unique two‐step approach for enhanced performance in obscene detection. In the first step, a scalable and efficient Swin transformer block is integrated, utilizing self‐attention and model parallelism to train massive models effectively. In the second phase, the embedding layer of the Swin transformer is replaced with ABP, mitigating disruption of feature context. ABP allows for the projection of raw‐valued features into linear form with proper attention to feature context information at specified locations, resulting in optimized feature extraction. The proposed ABP embedded Swin transformer‐based YOLOv3 (ASYv3) was trained with annotated obscene images (AOI) dataset. The proposed ASYv3 model surpassed the state‐of‐the‐art methods by achieving 97% testing accuracy, 96.62% precision, 97.40% sensitivity, 3.48% FPR rate, 97.37% NPV values, and 95.59% mAP values, respectively.