Recently, transformer architecture has gained great success in the computer vision community, such as image classification, object detection, etc. Nonetheless, its application for 3D vision remains to be explored, given that point cloud is inherently sparse, irregular, and unordered. Furthermore, existing point transformer frameworks usually feed raw point cloud of N×3 dimension into transformers, which limits the point processing scale because of their quadratic computational costs to the input size N. In this paper, we rethink the structure of point transformer. Instead of directly applying transformer to points, our network (TransLO) can process tens of thousands of points simultaneously by projecting points onto a 2D surface and then feeding them into a local transformer with linear complexity. Specifically, it is mainly composed of two components: Window-based Masked transformer with Self Attention (WMSA) to capture long-range dependencies; Masked Cross-Frame Attention (MCFA) to associate two frames and predict pose estimation. To deal with the sparsity issue of point cloud, we propose a binary mask to remove invalid and dynamic points. To our knowledge, this is the first transformer-based LiDAR odometry network. The experiment results on the KITTI odometry dataset show that our average rotation and translation RMSE achieves 0.500°/100m and 0.993% respectively. The performance of our network surpasses all recent learning-based methods and even outperforms LOAM on most evaluation sequences.Codes will be released on https://github.com/IRMVLab/TransLO.