Bone age assessment (BAA) based on X-ray imaging of the left hand and wrist can accurately reflect the degree of the body’s physiological development and physical condition. However, the traditional manual evaluation method relies too much on inefficient specialist labor. In this paper, to propose automatic BAA, we introduce a hierarchical convolutional neural network to detect the regions of interest (ROI) and classify the bone grade. Firstly, we establish a dataset of children’s BAA containing 2518 left hand X-rays. Then, we use the fine-grained classification to obtain the grade of the region of interest via object detection. Specifically, fine-grained classifiers are based on context-aware attention pooling (CAP). Finally, we perform the model assessment of bone age using the third version of the Tanner–Whitehouse (TW3) methodology. The end-to-end BAA system provides bone age values, the detection results of 13 ROIs, and the bone maturity of the ROIs, which are convenient for doctors to obtain information for operation. Experimental results on the public dataset and clinical dataset show that the performance of the proposed method is competitive. The accuracy of bone grading is 86.93%, and the mean absolute error (MAE) of bone age is 7.68 months on the clinical dataset. On public dataset, the MAE is 6.53 months. The proposed method achieves good performance in bone age assessment and is superior to existing fine-grained image classification methods.