UPCommonsPortal del coneixement obert de la UPC http://upcommons.upc.edu/e-prints Abstract: Volume-of-Fluid (VOF) is one of the methods of choice to reproduce the interface motion in the simulation of multi-fluid flows. One of its main strengths is its accuracy in capturing sharp interface geometries, although requiring for it a number of geometric calculations. Under these circumstances, achieving parallel performance on current supercomputers is a must. The main obstacle for the parallelization is that the computing costs are concentrated only in the discrete elements that lie on the interface between fluids. Consequently, if the interface is not homogeneously distributed throughout the domain, standard domain decomposition (DD) strategies lead to imbalanced workload distributions. In this paper, we present a new parallelization strategy for general unstructured VOF solvers, based on a dynamic load balancing process complementary to the underlying DD. Its parallel efficiency has been analyzed and compared to the DD one using up to 1024 CPU-cores on an Intel SandyBridge based supercomputer. The results obtained on the solution of several artificially generated test cases show a speedup of up to 12x with respect to the standard DD, depending on the interface size, the initial distribution and the number of parallel processes engaged. Moreover, the new parallelization strategy presented is of general purpose, therefore, it could be used to parallelize any VOF solver without requiring changes on the coupled flow solver. Finally, note that although designed for the VOF method, our approach could be easily adapted to other interface-capturing methods, such as the Level-Set, which may present similar workload imbalances.
SUMMARY OF CHANGESAccordingly to the comments raised by the reviewer, the changes on the manuscript are:1. As suggested by Reviewer #1: (1) Eq. 3 is valid only for divergence-free velocity fields, hence, this condition has been introduced in the text; (2) the limits of the sum in the numerator of Eqs. 9 -11 were incorrect, therefore, they have been corrected to start at 0 and end at P-1; (3) the description of Fig. 10 has been improved.2. The reviewer observed that Alg. 4 did not scale in terms of memory, since vector WI gathered information from all processes and, thus, grew with the size of the problem and number of CPU-cores. This may had become a bottleneck on larger scale tests. Therefore, we have developed a new version of the algorithm overcoming such limitations. The new version is based on non-blocking point-to-point communications.The new algorithm has substituted the previous one on the manuscript, and all tests for the LB strategy have been repeated and the results updated. Since Alg. 4 represents only 5% of the algorithm time, its upgrading has provided rather small improvements on the results. Even so, the important point is that the memory bottleneck has been eliminated.3. Furthermore, we have corrected some minor errors and we have tried to improve the wording across all the manuscrip...