In this paper, we present a novel methodology based on machine learning for identifying the most appropriate from a set of available state-of-the-art object detectors for a given application. Our particular interest is to develop a road map for identifying verifiably optimal selections, especially for challenging applications such as detecting small objects in a mixed-size object dataset. State-of-the-art object detection systems often find the localisation of small-size objects challenging since most are usually trained on large-size objects. These contain abundant information as they occupy a large number of pixels relative to the total image size. This fact is normally exploited by the model during training and inference processes. To dissect and understand this process, our approach systematically examines detectors’ performances using two very distinct deep convolutional networks. The first is the single-stage YOLO V3 and the second is the double-stage Faster R-CNN. Specifically, our proposed method explores and visually illustrates the impact of feature extraction layers, number of anchor boxes, data augmentation, etc., utilising ideas from the field of explainable Artificial Intelligence (XAI). Our results, for example, show that multi-head YOLO V3 detectors trained using augmented data produce better performance even with a fewer number of anchor boxes. Moreover, robustness regarding the detector’s ability to explain how a specific decision was reached is investigated using different explanation techniques. Finally, two new visualisation techniques are proposed, WS-Grad and Concat-Grad, for identifying explanation cues of different detectors. These are applied to specific object detection tasks to illustrate their reliability and transparency with respect to the decision process. It is shown that the proposed techniques can result in high resolution and comprehensive heatmaps of the image areas, significantly affecting detector decisions as compared to the state-of-the-art techniques tested.