The process of generating schematic maps of salient objects from a set of pictures of an indoor environment is challenging. It has been an active area of research as it is crucial to a wide range of context-and locationaware services, as well as for general scene understanding. Although many automated systems have been developed to solve the problem, most of them either require predefining labels or expensive equipment, such as RGBD sensors or lasers, to scan the environment. In this article, we introduce a prototype system to show how human computations can be utilized to generate schematic maps from a set of pictures, without making strong assumptions or demanding extra devices. The system requires humans (crowd workers from Amazon Mechanical Turks) to do simple spatial mapping tasks in various conditions, and their data are aggregated by filtering and clustering techniques that allow salient cues to be identified in the pictures and their spatial relations to be inferred and projected on a two-dimensional map. In particular, we tested and demonstrated the effectiveness of two methods that improved the quality of the generated schematic map: (1) We encouraged humans to adopt an allocentric representations of salient objects by guiding them to perform mental rotations of these objects and (2) we sensitized human perception by guided arrows superimposed on the imagery to improve the accuracy of depth and width estimation. We demonstrated the feasibility of our system by evaluating the results of schematic maps generated from indoor pictures taken from an office building. By calculating Riemannian shape distances between the generated maps to the ground truth, we found that the generated schematic maps captured the spatial relations well. Our results showed that the combination of human computations and machine clustering could lead to more-accurate schematized maps from imagery. We also discuss how our approach may have important insights on methods that leverage human computations in other areas.