Urban greening plays a crucial role in maintaining environmental sustainability and enhancing people’s well-being. However, limited by the shortcomings of traditional methods, studying the heterogeneity and nonlinearity between environmental factors and green view index (GVI) still faces many challenges. To address the concerns of nonlinearity, spatial heterogeneity, and interpretability, an interpretable spatial machine learning framework incorporating the Geographically Weighted Random Forest (GWRF) model and the SHapley Additive exPlanation (Shap) model is proposed in this paper. In this paper, we combine multi-source big data, such as Baidu Street View data and remote sensing images, and utilize semantic segmentation models and geographic data processing techniques to study the global and local interpretation of the Beijing region with GVI as the key indicator. Our research results show that: (1) Within the Sixth Ring Road of Beijing, GVI shows significant spatial clustering phenomenon and positive correlation linkage, and at the same time exhibits significant spatial differences; (2) Among many environmental variables, the increase of green coverage rate has the most significant positive effect on GVI, while the increase of building density shows a strong negative correlation with GVI; (3) The performance of the GWRF model in predicting GVI is excellent and far exceeds that of comparison models.; (4) Whether it is the green coverage rate, urban built environment or socioeconomic factors, their influence on GVI shows non-linear characteristics and a certain threshold effect. With the help of these non-linear influences and explicit threshold effects, quantitative analyses of greening are provided, which can help to assist urban planners in making more scientific and rational decisions when allocating greening resources.