Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the state-of-the-art in various computer vision applications, outperforming Convolutional Neural Networks (CNNs). However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality. The standard SAR datasets have a limited number of labeled training data, reducing the learning capability of ViTs (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without pre-training. To this end, we incorporate the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules into the ViT model. We directly train this model on SAR datasets to evaluate its effectiveness for SAR ATR applications. The proposed model, VTR (ViT for SAR ATR), is evaluated on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Experimental results show that the proposed VTR model achieves a classification accuracy of 95.96%, 93.47%, and 99.46% on MSTAR, SynthWakeSAR, and GBSAR datasets, respectively. VTR achieves accuracy comparable to the state-of-the-art models on MSTAR and GBSAR datasets with 1.1× and 36× smaller model sizes, respectively. On SynthWakeSAR dataset, VTR achieves a higher accuracy with a model size that is 17× smaller. Further, a novel FPGA accelerator is proposed for VTR, to enable real-time SAR ATR applications. Compared with the implementation of VTR on state-of-the-art CPU and GPU platforms, our FPGA implementation achieves latency reduction by a factor of 70× and 30×, respectively. For inference on small batch sizes, our FPGA implementation achieves a 2× higher throughput compared with GPU.