The utilization of street view big data is increasingly being used to uncover visual characteristics and spatial perceptions of urban streets. However, there is a lack of studies that combine street view big data for perceptual evaluation in underdeveloped ethnic areas and better street quality. This study integrates deep learning methods to create a human–computer confrontational model for perception score, with a focus on the central city of Lhasa in Tibet. Pearson correlation analysis was conducted on six dimensions of perception data (beautiful, wealthy, safe, lively, boring and depressing) and visual elements. The streets in the top 20% for both visual elements and perceptual scores were identified to reveal areas with high visual element proportions and high perceptual scores. The spatial distribution characteristics and correlation between visual elements and street perceptions were thoroughly analyzed. The findings of this study reveal that the central city of Lhasa exhibited high percentages of visual elements in buildings (88.23%), vegetation (89.52%), and poles (3.14%). Out of the six perceptions examined, the highest scores were for boring (69.70) and depressing (67.76) perceptions, followed by beautiful (60.66) and wealthy (59.91) perceptions, with lively (56.68) and safe (50.64) perceptions receiving the lowest scores. Visual components like roads (−0.094), sidewalks (−0.031), fences (−0.036), terrain (−0.020), sky (−0.098), cars (−0.016), and poles (−0.075) were observed to have a significant deterring effect on the boring perception, while other visual elements showed a positive influence. This investigation seeks to provide valuable insights for the design and advancement of urban streets in marginalized ethnic localities, addressing a void in perception research of urban streets in such areas.