Compressive sensing provides a promising sampling paradigm for video acquisition for resource‐limited sensor applications. However, the reconstruction of original video signals from sub‐sampled measurements is still a great challenge. To exploit the temporal redundancies within videos during the recovery, previous works tend to perform alignment on initial reconstructions, which are too coarse to provide accurate motion estimations. To solve this problem, the authors propose a novel reconstruction network, named TSRN, for compressive video sensing. Specifically, the authors utilise a number of stacked temporal shift reconstruction blocks (TSRBs) to enhance the initial reconstruction progressively. Each TSRB could learn the temporal structures by exchanging information with last and next time step, and no additional computations is imposed on the network compared to regular 2D convolutions due to the high efficiency of temporal shift operations. After the enhancement, a bidirectional alignment module to build accurate temporal dependencies directly with the help of optical flows is employed. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations, thus yielding better performance. Experimental results verify the superiority of the proposed method over other state‐of‐the‐art approaches quantitatively and qualitatively.