Background: Fast dose calculation is critical for online and real-time adaptive therapy workflows. While modern physics-based dose algorithms must compromise accuracy to achieve low computation times, deep learning models can potentially perform dose prediction tasks with both high fidelity and speed. Purpose: We present a deep learning algorithm that, exploiting synergies between transformer and convolutional layers, accurately predicts broad photon beam dose distributions in few milliseconds. Methods: The proposed improved Dose Transformer Algorithm (iDoTA) maps arbitrary patient geometries and beam information (in the form of a 3D projected shape resulting from a simple ray tracing calculation) to their corresponding 3D dose distribution. Treating the 3D CT input and dose output volumes as a sequence of 2D slices along the direction of the photon beam, iDoTA solves the dose prediction task as sequence modeling. The proposed model combines a Transformer backbone routing long-range information between all elements in the sequence, with a series of 3D convolutions extracting local features of the data. We train iDoTA on a dataset of 1700 beam dose distributions, using 11 clinical volumetric modulated arc therapy (VMAT) plans (from prostate, lung, and head and neck cancer patients with 194-354 beams per plan) to assess its accuracy and speed. Results: iDoTA predicts individual photon beams in ≈ 50 ms with a high gamma pass rate of 97.72 ± 1.93% (2 mm, 2%). Furthermore, estimating full VMAT dose distributions in 6-12 s, iDoTA achieves state-of -the-art performance with a 99.51 ± 0.66% (2 mm, 2%) pass rate and an average relative dose error of 0.75 ± 0.36%. Conclusions: Offering the millisecond speed prediction per beam angle needed in online and real-time adaptive treatments, iDoTA represents a new state of the art in data-driven photon dose calculation. The proposed model can massively speed-up current photon workflows, reducing calculation times from few minutes to just a few seconds.