The built environment reshapes various scenes that can be perceived, experienced, and interpreted, which are known as city images. City images emerge as the complex composite of various imagery elements. Previous studies demonstrated the coincide between the city images produced by experts with prior knowledge and that are extracted from the high-frequency photo contents generated by citizens. The realistic city images hidden behind the volunteered geo-tagged photos, however, are more complex than assumed. The dominating elements are only one side of the city image; more importantly, the interactions between elements are also crucial for understanding how city images are structured in people’s minds. This paper focuses on the composition of city image–the various interactions between imagery elements and areas of a city. These interactions are identified as four aspects: co-presence, hierarchy, heterogeneity, and differentiation, which are quantified and visualized respectively as correlation network, dendrogram, spatial clusters, and scattergrams in a framework using scene recognition with volunteered and georeferenced photos. The outputs are interdependent elements, typologies of elements, imagery areas, and preferences for groups, which are essential for urban design processes. In the application in Central Beijing, the significant interdependency between two elements is complex and is not necessarily an interaction between the elements with higher frequency only. The main typologies and the principal imagery elements are different from what were prefixed in the image recognition model. The detected imagery areas with adaptive thresholds suggest the spatially varying spill over effects of named areas and their typologies can be well annotated by the detected principal imagery elements. The aggregation of the data from different social media platforms is proven as a necessity of calibrating the unbiased scope of the city image. Any specific data can hardly capture the whole sample. The differentiation across the local and non-local is found to be related to their preference and activity space. The results provide more comprehensive insights on the complex composition of city images and its effects on placemaking.