Emerging research revealed that the view-type of photos is not only related to the field of data sciences, such as the sentiment brought forth by sightseeing spots, but also in the field of social sciences about human emotions and behaviors. These potential usages of view-types trigger a challenging problem, that is to automatically distinguish them into wide or narrow. In this paper, we present a computational model to classify them inspired by the human visual system. We found two cues that can represent the visual attention, i.e., focus cue and scale cue. The focus cue is modeled in the frequency domain using the non-sampled contourlet transform (NSCT) and speeded up robust features (SURF). The scale cue is modeled by defining the spatial size and conceptual sizes of an object in the image, whereby AdobeBING and convolutional neural network are used for the respective measurements. By integrating these focus and scale models, a robust scheme is hence proposed for this non-trivial task. The experiments on a newly established dataset, which has 5050 natural images, show better performance by our proposal when compared to the state-of-the-arts.