The task of object pose estimation in computer vision heavily relies on both color (RGB) and depth (D) images to provide crucial appearance and geometric information, assisting algorithms in understanding occlusions and object geometry, thereby enhancing accuracy. However, the dependency on specialized sensors capable of capturing depth poses challenges in terms of cost and availability. Consequently, researchers are exploring methods to estimate object poses solely from RGB images. Nevertheless, this approach encounters difficulties in handling occlusions, discerning object geometry, and resolving ambiguities arising from similar color or texture patterns. This paper introduces a novel geometry-aware method for object pose estimation utilizing RGB images as input to determine the poses of multiple object instances. Our approach leverages both depth and color images during training but only relies on color images during inference. Departing from traditional depth sensors, our method computes predicted point clouds directly from estimated depth images derived from RGB inputs. A key innovation lies in the formulation of a multi-scale fusion module adept at seamlessly integrating features extracted from RGB images with those inferred from the predicted point clouds. This fusion process significantly fortifies the pose estimation pipeline by harnessing the strengths of both modalities, resulting in notably improved object poses. Extensive experimentation demonstrates that our approach markedly outperforms state-of-the-art RGB-based methods on Occluded-LINEMOD and YCB-Video datasets. Moreover, our method achieves competitive results compared to RGB-D approaches that necessitate both RGB and depth data from physical sensors.