Visual tracking is one of the most fundamental topics in computer vision. Numerous tracking approaches based on discriminative correlation filters or Siamese convolutional networks have attained remarkable performance over the past decade. However, it is still commonly recognized as an open research problem to develop robust and effective trackers which can achieve satisfying performance with high computational and memory storage efficiency in real-world scenarios. In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second. with real-time speed are widely pursued in numerous applications, including video surveillance, autonomous driving, pose estimation, and human-computer interfaces. Although considerable progress has been made by many tracking approaches in the last decade, visual tracking is still a challenging problem due to multiple negative scenarios such as occlusions, fast motions, scale variations, and background clutters [1,2,3,4].Most recent tracking approaches are developed based on two successful frameworks. The first one is the discriminative correlation filter (DCF). Thanks to the circular shifting of training samples and fast learning correlation filters in the Fourier frequency domain, DCF demonstrates superior computational efficiency and fairly good tracking accuracy and has attracted increasing attention since MOSSE [5]first exploited it. Later, in order to overcome different kinds of challenges and achieve competitive tracking performance, the advancement in DCF-based tracking is focused on the use of kernel functions [6], motion information [7,8], multi-dimensional features [9,10], multi-scale estimation [11,12], boundary effects allevi...