Taking cities as objects being observed, urban remote sensing is an important branch of remote sensing. Given the complexity of the urban scenes, urban remote sensing observation requires data with a high temporal resolution, high spatial resolution, and high spectral resolution. To the best of our knowledge, however, no satellite owns all the above characteristics. Thus, it is necessary to coordinate data from existing remote sensing satellites to meet the needs of urban observation. In this study, we abstracted the urban remote sensing observation process and proposed an urban spatio-temporal-spectral observation model, filling the gap of no existing urban remote sensing framework. In this study, we present four applications to elaborate on the specific applications of the proposed model: 1) a spatiotemporal fusion model for synthesizing ideal data, 2) a spatio-spectral observation model for urban vegetation biomass estimation, 3) a temporal-spectral observation model for urban flood mapping, and 4) a spatio-temporal-spectral model for impervious surface extraction. We believe that the proposed model, although in a conceptual stage, can largely benefit urban observation by providing a new data fusion paradigm.