Synthetic aperture radar (SAR) tomography (TomoSAR) can obtain 3D imaging models of observed urban areas and can also discriminate different scatters in an azimuth–range pixel unit. Recently, compressive sensing (CS) has been applied to TomoSAR imaging with the use of very-high-resolution (VHR) SAR images delivered by modern SAR systems, such as TerraSAR-X and TanDEM-X. Compared with the traditional Fourier transform and spectrum estimation methods, using sparse information for TomoSAR imaging can obtain super-resolution power and robustness and is only minorly impacted by the sidelobe effect. However, due to the tight control of SAR satellite orbit, the number of acquisitions is usually too low to form a synthetic aperture in the elevation direction, and the baseline distribution of acquisitions is also uneven. In addition, artificial outliers may easily be generated in later TomoSAR processing, leading to a poor mapping product. Focusing on these problems, by synthesizing the opinions of various experts and scholarly works, this paper briefly reviews the research status of sparse TomoSAR imaging. Then, a joint sparse imaging algorithm, based on the building points of interest (POIs) and maximum likelihood estimation, is proposed to reduce the number of acquisitions required and reject the scatterer outliers. Moreover, we adopted the proposed novel workflow in the TerraSAR-X datasets in staring spotlight (ST) work mode. The experiments on simulation data and TerraSAR-X data stacks not only indicated the effectiveness of the proposed approach, but also proved the great potential of producing a high-precision dense point cloud from staring spotlight (ST) data.