In many applications involving incompressible fluid flow, the Stokes system plays an important role. Complex flow problems may require extremely fine resolutions, easily resulting in saddle-point problems with more than a trillion (10 12 ) unknowns. Even on the most advanced supercomputers, the fast solution of such systems of equations is a highly nontrivial and challenging task. In this work we consider a realization of an iterative saddle-point solver which is based mathematically on the Schur-complement formulation of the pressure and algorithmically on the abstract concept of hierarchical hybrid grids. The design of our fast multigrid solver is guided by an innovative performance analysis for the computational kernels in combination with a quantification of the communication overhead. Excellent node performance and good scalability to almost a million parallel threads are demonstrated on different characteristic types of modern supercomputers.1. Introduction. Current leading edge supercomputers can provide performance in the order of several petaflop/s, enabling the development of increasingly complex and accurate computational models having unprecedented size. This is especially relevant in flow simulations that may exhibit many small scale features that must be resolved over large domains. As an example, the problem of earth mantle convection is posed on a thick spherical shell of approximately 3 000 km depth and 6 300 km radius, resulting in an overall volume of close to a trillion, that is, 10 12 km 3 . A high resolution then results automatically in huge algebraic systems.Although finite element (FE) methods are flexible enough to handle different local mesh-sizes, fully adaptive meshing techniques require dynamic data structures and a complex program control flow that incurs significant computational cost. Recent work on parallel adaptive FE techniques can be found, e.g., in [1,2,11,44]. In [10] it is shown that an adaptive parallel FE method can reach locally 1 km resolution for the mantle convection problem on a large scale supercomputer. Here we will demonstrate that such a resolution can even be reached globally.Higher order FE approaches can lead to a better accuracy with the same number of unknowns, but the linear systems are denser. This implies more computational work, more memory access cost, and also higher parallel communication cost, so