Excellent performance has been demonstrated by convolutional neural network (CNN) in salient object detection for optical remote sensing images (ORSI-SOD). However, the limitations of CNN's feature extraction using sliding window approach hinder the capture of global representations. Therefore, an end-to-end detection model, known as adaptive dual-stream sparse transformer network (ADSTNet), has been proposed for ORSI-SOD and is assisted by the vision transformer. It effectively addresses the compensation issue of global and local information in ORSI-SOD. In particular, an adaptive interaction encoder has been devised, amalgamating the multi-scale sparse transformer (MST) and the pyramid atrous attention (PAA) to constitute the adaptive dual-stream sparse encoder (ADSE). This encoder collaborates with the CNN to enhance longrange dependency modeling and preserve global information more effectively base on local features. Additionally, a directional feature reconfiguration (DFR) is constructed to extract texture details from multiple directional dimensions. Finally, we propose the adaptive feature cascade decoder (AFCD) that synthesizes content information from the foreground, edges, and background to enhance the representational capacity of the image. Furthermore, a structural loss function, known as the weight compensation mechanism, is introduced to balance the performance of boundary and salmap segmentation losses. The proposed model has been demonstrated to outperform 26 stateof-the-art (SOTA) ORSI-SOD methods across eight evaluation metrics on two standard datasets, as evidenced by extensive experiments. Furthermore, to verify its robustness, the generalization performance of the model on the latest challenging ORSI-4199 dataset is reported. The code and results for this work can be found at https://github.com/JieZzzoo/ADSTNet.