Purpose: VMAT optimization is a computationally challenging problem due to its large data size, high degrees of freedom, and many hardware constraints. Highperformance graphics processing units (GPUs) have been used to speed up the computations. However, GPU's relatively small memory size cannot handle cases with a large dose-deposition coefficient (DDC) matrix in cases of, e.g., those with a large target size, multiple targets, multiple arcs and/or small beamlet size. The main purpose of this paper is to report an implementation of a column-generation based VMAT algorithm, previously developed in our group, on a multi-GPU platform to solve the memory limitation problem. While the column-generation based VMAT algorithm has been previously developed, the GPU implementation details have not been reported. Hence, another purpose is to present detailed techniques employed for GPU implementation. We also would like to utilize this particular problem as an example problem to study the feasibility of using a multi-GPU platform to solve large-scale problems in medical physics. Methods: The column-generation approach generates VMAT apertures sequentially by solving a pricing problem (PP) and a master problem (MP) iteratively. In our method, the sparse DDC matrix is first stored on CPU in coordinate list format (COO). On the GPU side, this matrix is split into four sub-matrices according to beam angles, which are stored on four GPUs in compressed sparse row (CSR) format. Computation of beamlet price, the first step in PP, is accomplished using multi-GPUs. A fast inter-GPU data transfer scheme is designed using peer-to-peer (P2P) access. The remaining steps of PP and MP problems are implemented on CPU or a single GPU due to their modest problem scale and computational loads. Barzilai and Borwein (BB) algorithm with subspace step scheme is adopted here to solve the MP problem. A head and neck (H&N) cancer case was used to validate our method. We also compare our multi-GPU implementation with three different single GPU implementation strategies: truncating DDC matrix (S1), repeatedly transferring DDC matrix between CPU and GPU (S2), and porting computations involving DDC matrix to CPU (S3), in terms of both plan quality and computational efficiency. Two 2 more H&N patient cases and three prostate cases were also used to demonstrate the advantages of our method. Results: Our multi-GPU implementation can finish the optimization process within ~1 minute for the H&N patient case. S1 leads to an inferior plan quality although its total time was 10 seconds shorter than the multi-GPU implementation due to the reduced matrix size. S2 and S3 yield the same plan quality as the multi-GPU implementation but take ~4 minutes and ~6 minutes, respectively. High computational efficiency was consistently achieved for the other 5 patient cases tested, with VMAT plans of clinically acceptable quality obtained within 23~46 seconds. Conversely, to obtain clinically comparable or acceptable plans for all these 6 VMAT cases that we have tested in t...