The last 5 to 6 years have seen tremendous progress in automatic image captioning using deep learning. Initial research focused on the attribute-to-attribute comparison of image features and texts to describe the image as a sentence, the current research is handling issues related to semantics and correlations. However, current state of art research suffers from insufficient concepts when it comes to positional and geometrical attributes. The majority of research relying on CNN's (Convolutional Neural Networks) for object feature extractions has no clue about equivariance and rotational invariance which leads towards the orientation-less understanding of objects for captioning along with longer training time, and larger dataset. Furthermore, CNN's based image captioning encoders also fail to understand the geometrical alignment of object attributes within the image and hence mislabels distorted as correct. To cater to the above issues, we propose ICPS (Captioning with Positional and geometrical Semantics) a capsule network-based image captioning technique along with transformer neural networks as the decoder. The proposed ICPS architecture handles various geometrical properties of image objects with the help of parallelized capsules while the object-to-text decoding is done by Transformer Neural Networks. The inclusion of cluster capsules provides better object understanding in terms of position, equivariance, and geometrical orientation with more augmented object understanding over a small dataset in comparatively less time. The extracted image features provide a better understanding of image objects and help the decoding stage to narrate effectively with positional and geometrical details. We trained and tested our ICPS over the Flickr8k dataset and found ourselves to be better at captioning when it comes to describing the positional and geometrical transitions as compared to other current state-of-the-art research.