Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task parallel programming, and show that it is competitive. arXiv:1801.03578v1 [cs.DC] 10 Jan 2018 to the computational work performed by one node, the 1×9 process grid has the smallest variance between nodes, and therefore also the lowest maximum work size. The 9 × 1 process grid leads to smaller maximum work size than the 3 × 3 process grid if B is large enough, but suffers from significant load imbalance in the case B = 18. In all cases, the work becomes more evenly distributed if the number of level 1 tasks B is larger. The statistics for communication and computation point in different directions, but when comparing with actual run times, we have found that the communication size is the most informative measure. Having a large total communication size is likely to be detrimental to performance as the risk of tasks left waiting for remote data increases as well as the risk of congestion of messages. A square process grid is the factor that has the largest impact. Regarding the block sizes, having a large B improves the load balance, but increases the amount of communication as well as the number of messages (another indicator that is not shown in the graphics).