Single-pixel imaging (SPI) using deep learning networks, e.g., convolutional neural networks (CNNs) and vision transformers (ViTs), has made significant progress. However, these existing models, especially those based on ViT architectures, pose challenges due to their large number of parameters and computational loads, making them unsuitable for mobile SPI applications. To break through this limitation, we propose mobile ViT blocks to bring down the computation cost of traditional ViTs, and combine CNNs to design what we believe to be a novel lightweight CNN-ViT hybrid model for efficient and accurate SPI reconstruction. In addition, we also propose a general-purpose differential ternary modulation pattern scheme for deep learning SPI (DLSPI), which is training-friendly and hardware-friendly. Simulations and real experiments demonstrate that our method has higher imaging quality, lower memory consumption, and less computational burden than the state-of-the-art DLSPI methods.