This work presents cost-effective multi-graphics processing unit (GPU) parallel implementations of a finitevolume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow-water equations, whereas the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first-order Roe finite-volume scheme. Specifically, this paper presents multi-GPU implementations of both a solution that exploits recomputation on the GPU and an optimized solution that is based on a ghost cell decoupling approach. Our multi-GPU implementations have been optimized using nonblocking communications, overlapping communications and computations and the application of ghost cell expansion to minimize communications. The fastest one reached a speedup of 78 using four GPUs on an InfiniBand network with respect to a parallel execution on a multicore CPU with six cores and two-way hyperthreading per core. Such performance, measured using a realistic problem, enabled the calculation of solutions not only in real time but also in orders of magnitude faster than the simulated time.synchronization between thread blocks). For this purpose, the 2D domain is split into 2D subdomains that include several ghost cells. The ghost cells represent flux contributions that are recomputed in two neighbor thread blocks. In our shallow-water problem, these ghost cells are a row and a column of each 2D subdomain. This way, this memory region (ghost region from now on) is read by two thread blocks, although it is only updated by one. The reason is that the ghost region, together with the rows and columns whose updating is the responsibility of the thread block, provides the information the thread block needs to perform its computations, therefore making the block self-sufficient. Overall, the ghost cell decoupling technique removes the replicated computations for most of the cells, the exception being the ghost cells of each thread block.The algorithm shown in Figure 4 shows the implementation details of the ghost cell decoupling technique. The thread responsible for volume v computes the flow from the neighbor volumes on the right (flR in line 8) and bottom (flD in line 9). Next, a partial flow M OEv is calculated as flROEv C flDOEv (line 10). The same procedure is followed to obtain the partial tOEv (lines 12-14). These partial values, flROEv, flDOEv, dtROEv and dtDOEv, are stored in the shared memory, so that in the second phase (lines 19-22), the thread responsible for volume v only has to add to its partial flow the opposite contribution of its left and up neighbors (line 20; correspondingly, tOEv in line 21). A synchronization barrier is needed between the first and second phases (line 17) because in the second phase each thread reads the partial flows (lines 20-21) stored in the shared memory by another thread in the first phase (lines 8-14). In CUDA, a synchronization barrier (__syncthreads()) stops all warps within a given thread block until all the warps have rea...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.