A fine-grained block incomplete LU (FGBILU) factorization for solving large-scale block-sparse linear systems resulting from coupled PDE systems with n equations has been recently developed for massively parallel heterogeneous architectures, such as generalpurpose graphics processing units (GPGPUs). A straightforward one-sweep wavefront ordering is combined with element-wise block submatrix operations, allowing FGBILU to achieve low-overhead concurrent computation at O n 2 N 2 scale on a 3D PDE domain with a linear scale of N . Numerical experiments show that FGBILU is less efficient on smaller domains. Besides the inevitable performance penalty of a wavefront ordering, the index reconstruction by each concurrent computation thread causes considerable parallelism overhead. One way to reduce the overhead is to employ thread recycling along with CUDA inter-block synchronization. Dynamic parallelism is also attempted, although with no significant perforamnce benefit. The improved FGBILU is tested for a series of 3D PDE domains extracted from an incompressible Navier-Stokes solver called INCOMP3D. Results show that thread recycling can significantly reduce parallelism overhead and improve the performance of FGBILU on smaller domains.