Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multilevel hardware parallelism to harness the available computational power. The performance of applications executing on such HPC systems may adversely be affected by load imbalance at multiple levels, caused by problem, algorithmic, and systemic characteristics. Nevertheless, most existing load balancing methods do not simultaneously address load imbalance at multiple levels. This work investigates the impact of load imbalance on the performance of three scientific applications at the thread and process levels. We jointly apply and evaluate selected dynamic loop self-scheduling (DLS) techniques to both levels. Specifically, we employ the extended LaPeSD OpenMP runtime library [1] at the thread level, and extend the DLS4LB MPI-based dynamic load balancing library [2] at the process level. This approach is generic and applicable to any multiprocess-multithreaded computationally-intensive application (programmed using MPI and OpenMP). We conduct an exhaustive set of experiments to assess and compare six DLS techniques at the thread level and eleven at the process level. The results show that improved application performance, by up to 21%, can only be achieved by jointly addressing load imbalance at the two levels. We offer insights into the performance of the selected DLS techniques and discuss the interplay of load balancing at the thread level and process level.