BackgroundHigh relapse rates remain a clinical challenge in the management of breast cancer (BC), with distant recurrence being a major driver of patient deterioration. To optimize the surveillance regimen for distant recurrence after neoadjuvant chemotherapy (NAC), we conducted a comprehensive analysis using bioinformatics and machine learning approaches.Materials and methodsMicroarray data were retrieved from the GEO database, and differential expression analysis was performed with the R package ‘Limma’. We used the Metascape tool for enrichment analyses, and ‘WGCNA’ was utilized to establish co-expression networks, selecting the soft threshold power with the ‘pickSoftThreshold’ algorithm. We integrated ten machine learning algorithms and 101 algorithm combinations to identify key genes associated with distant recurrence in BC. Unsupervised clustering was performed with the R package ‘ConsensusCluster Plus’. To further screen the key gene signature of residual cancer burden (RCB), multiple knockdown studies were analyzed with the Genetic Perturbation Similarity Analysis (GPSA) database. Single-cell RNA sequencing (scRNA-seq) analysis was conducted through the Tumour Immune Single-cell Hub (TISCH) database, and the XSum algorithm was used to screen candidate small molecule drugs based on the Connectivity Map (CMAP) database. Molecular docking processes were conducted using Schrodinger software. GMT files containing gene sets associated with metabolism and senescence were obtained from GSEA MutSigDB database. The GSVA score for each gene set across diverse samples was computed using the ssGSEA function implemented in the GSVA package.ResultsOur analysis, which combined Limma, WGCNA, and machine learning approaches, identified 16 RCB-relevant gene signatures influencing distant recurrence-free survival (DRFS) in BC patients following NAC. We then screened GATA3 as the key gene signature of high RCB index using GPSA analysis. A novel molecular subtyping scheme was developed to divide patients into two clusters (C1 and C2) with different distant recurrence risks. This molecular subtyping scheme was found to be closely associated with tumor metabolism and cellular senescence. Patients in cluster C2 had a poorer DRFS than those in cluster C1 (HR: 4.04; 95% CI: 2.60–6.29; log-rank test p < 0.0001). High GATA3 expression, high levels of resting mast cell infiltration, and a high proportion of estrogen receptor (ER)-positive patients contributed to better DRFS in cluster C1. We established a nomogram based on the N stage, RCB class, and molecular subtyping. The ROC curve for 5-year DRFS showed excellent predictive value (AUC=0.91, 95% CI: 0.95–0.86), with a C-index of 0.85 (95% CI: 0.81–0.90). Entinostat was identified as a potential small molecule compound to reverse high RCB after NAC. We also provided a comprehensive review of the EDCs exposures that potentially impact the effectiveness of NAC among BC patients.ConclusionThis study established a molecular classification scheme associated with tumor metabolism and cancer cell senescence to predict RCB and DRFS in BC patients after NAC. Furthermore, GATA3 was identified and validated as a key gene associated with BC recurrence.