Predicting urban-scale carbon emissions (CEs) is crucial in drawing implications for various urgent environmental issues, including global warming. However, prior studies have overlooked the impact of the micro-level street environment, which might lead to biased prediction. To fill this gap, we developed an effective machine learning (ML) framework to predict neighborhood-level residential CEs based on a single data source, street view images (SVIs), which are publicly available worldwide. Specifically, more than 30 streetscape elements were classified from SVIs using semantic segmentation to describe the micro-level street environment, whose visual features can indicate major socioeconomic activities that significantly affect residential CEs. A ten-fold cross-validation was deployed to train ML models to predict the residential CEs at the 1 km grid level. We found, first, that random forest (R2 = 0.8) outperforms many traditional models, confirming that visual features are non-negligible in explaining CEs. Second, more building, wall, and fence views indicate higher CEs. Third, the presence of trees and grass is inversely related to CEs. Our findings justify the feasibility of using SVIs as a single data source to effectively predict neighborhood-level residential CEs. The framework is applicable to large regions across diverse urban forms, informing urban planners of sustainable urban form strategies to achieve carbon-neutral goals, especially for the development of new towns.