Cervical cancer is one of the most common and deadliest cancers among women and poses a serious health risk. Automated screening and diagnosis of cervical cancer will help improve the accuracy of cervical cell screening. In recent years, there have been many studies conducted using deep learning methods for automatic cervical cancer screening and diagnosis. Deep-learning-based Convolutional Neural Network (CNN) models require large amounts of data for training, but large cervical cell datasets with annotations are difficult to obtain. Some studies have used transfer learning approaches to handle this problem. However, such studies used the same transfer learning method that is the backbone network initialization by the ImageNet pre-trained model in two different types of tasks, the detection and classification of cervical cell/clumps. Considering the differences between detection and classification tasks, this study proposes the use of COCO pre-trained models when using deep learning methods for cervical cell/clumps detection tasks to better handle limited data set problem at training time. To further improve the model detection performance, based on transfer learning, we conducted multi-scale training according to the actual situation of the dataset. Considering the effect of bounding box loss on the precision of cervical cell/clumps detection, we analyzed the effects of different bounding box losses on the detection performance of the model and demonstrated that using a loss function consistent with the type of pre-trained model can help improve the model performance. We analyzed the effect of mean and std of different datasets on the performance of the model. It was demonstrated that the detection performance was optimal when using the mean and std of the cervical cell dataset used in the current study. Ultimately, based on backbone Resnet50, the mean Average Precision (mAP) of the network model is 61.6% and Average Recall (AR) is 87.7%. Compared to the current values of 48.8% and 64.0% in the used dataset, the model detection performance is significantly improved by 12.8% and 23.7%, respectively.