Football is one of the most popular sports worldwide and capable of attracting the attention of millions of fans to a single match in the top leagues. The English Premier League, Spanish LaLiga, German Bundesliga, Italian Serie A, and French Ligue 1 are the five best leagues in the world today. There was an experiment where researchers want to analyze the efficiency and accuracy percentage of tracking and detection using the deep learning method of the Mask R-CNN model in classifying positive and negative X-Ray images in football matches. In this study, we applied Mask R-CNN for the segmentation and detection of football players. This model was based on two different backbones, namely ResNet101 and DenseNet. Both backbones produced accuracy values that were not significantly different, but the DenseNet approach performed better than ResNet101 based on testing results in the validation and testing sets. Based on comprehensive experiment results on the dataset, it has been shown that the Mask R-CNN approach with DenseNet can achieve better results compared to Mask R-CNN with ResNet101. Due to insufficient understanding of the characteristics of image types and the uneven distribution of various types of data sourced from random videos, there was still room for improvement in the trained model.