Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a single GPU compared with mainstream CPUs, the performance of the LBM for multiple GPUs has not been studied extensively and systematically. In this article, we carry out LBM simulation on a GPU cluster with many nodes, each having multiple Fermi GPUs. Asynchronous execution with CUDA stream functions, OpenMP and non-blocking MPI communication are incorporated to improve efficiency. The algorithm is tested for two-dimensional Couette flow and the results are in good agreement with the analytical solution. For both the one-and two-dimensional decomposition of space, the algorithm performs well as most of the communication time is hidden. Direct numerical simulation of a two-dimensional gas-solid suspension containing more than one million solid particles and one billion gas lattice cells demonstrates the potential of this algorithm in large-scale engineering applications. The algorithm can be directly extended to the three-dimensional decomposition of space and other modeling methods including explicit grid-based methods. asynchronous execution, compute unified device architecture, graphic processing unit, lattice Boltzmann method, non-blocking message passing interface, OpenMP Citation: Xiong Q G, Li B, Xu J, et al. Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units. Chin Sci Bull, 2012, 57: 707715, doi: 10.1007/s11434-011-4908-yHigh-performance computing (HPC) on general-purpose graphical processing units (GPGPUs) has emerged as a competitive approach to make demanding computations such as those of computational fluid dynamics (CFD) [1,2] and discrete particle simulations [3][4][5]. This is, on one hand, due to the computational capacity of graphical processing units (GPUs), which is almost one order of magnitude higher than that of mainstream central processing units (CPUs)