Semantic classification of urban scenes aims to classify scenes composed of many different types of objects into predefined semantic classes. To learn the association between urban scenes and semantic classes, five tasks are needed: 1) segmenting the image into scenes; 2) establishing semantic classes of scenes; 3) extracting and transforming features; 4) measuring the intrascenes feature similarity; and 5) labeling each scene by a semantic classification method. Despite many efforts on these tasks, most existing works consider only visual features with inconsistent similarity measurement, while ignore semantic features inside scenes and the interactions between scenes, leading to poor classification results for high heterogeneous scenes. To solve these problems, this study combines intrascene feature similarity and interscene semantic dependency to form a two-step classification approach. For the first step, visual and semantic features are first optimized to be invariant to affine transformation, and then are employed in K-Nearest Neighbor to initially classify scenes. For the second step, multinomial distribution is presented to model both the spatial and semantic dependency between scenes, and then used to improve the initial classification results. The implementations conducted in two study areas indicate that the proposed approach produces better results for heterogeneous scenes than visual interpretation, as it can discover and model the hidden information between scenes which is often ignored by existing methods. In addition, compared with the initial classification, the optimized step improves accuracies by 3.6% and 5% in the two study areas, respectively.