Introduction: Rheumatoid arthritis (RA) is a systemic autoimmune disease; early diagnosis and treatment are crucial for its management. Currently, the modified total Sharp score (mTSS) is widely used as a scoring system for RA. The standard screening process for assessing mTSS is tedious and time-consuming. Therefore, developing an efficient mTSS automatic localization and classification system is of urgent need for RA diagnosis. Current research mostly focuses on the classification of finger joints. Due to the insufficient detection ability of the carpal part, these methods cannot cover all the diagnostic needs of mTSS. Method: We propose not only an automatic label system leveraging the You Only Look Once (YOLO) model to detect the regions of joints of the two hands in hand X-ray images for preprocessing of joint space narrowing in mTSS, but also a joint classification model depending on the severity of the mTSS-based disease. In the image processing of the data, the window level is used to simulate the processing method of the clinician, the training data of the different carpal and finger bones of human vision are separated and integrated, and the resolution is increased or decreased to observe the changes in the accuracy of the model. Results: Integrated data proved to be beneficial. The mean average precision of the proposed model in joint detection of joint space narrowing reached 0.92, and the precision, recall, and F1 score all reached 0.94 to 0.95. For the joint classification, the average accuracy was 0.88, and the accuracy of severe, mild, and healthy reached 0.91, 0.79, and 0.9, respectively. Conclusions: The proposed model is feasible and efficient. It could be helpful for subsequent research on computer-aided diagnosis in RA. We suggest that applying the one-hand X-ray imaging protocol can improve the accuracy of mTSS classification model in determining mild disease if it is used in clinical practice.