Manual harvesting of coconuts is a highly risky and skill-demanding operation, and the population of people involved in coconut tree climbing has been steadily decreasing. Hence, with the evolution of tree-climbing robots and robotic end-effectors, the development of autonomous coconut harvesters with the help of machine vision technologies is of great interest to farmers. However, coconuts are very hard and experience high occlusions on the tree. Hence, accurate detection of coconut clusters based on their occlusion condition is necessary to plan the motion of the robotic end-effector. This study proposes a deep learning-based object detection Faster Regional-Convolutional Neural Network (Faster R-CNN) model to detect coconut clusters as non-occluded and leaf-occluded bunches. To improve identification accuracy, an attention mechanism was introduced into the Faster R-CNN model. The image dataset was acquired from a commercial coconut plantation during daylight under natural lighting conditions using a handheld digital single-lens reflex camera. The proposed model was trained, validated, and tested on 900 manually acquired and augmented images of tree crowns under different illumination conditions, backgrounds, and coconut varieties. On the test dataset, the overall mean average precision (mAP) and weighted mean intersection over union (wmIoU) attained by the model were 0.886 and 0.827, respectively, with average precision for detecting non-occluded and leaf-occluded coconut clusters as 0.912 and 0.883, respectively. The encouraging results provide the base to develop a complete vision system to determine the harvesting strategy and locate the cutting position on the coconut cluster.