The added encoding efficiency and visual quality offered by the High Efficiency Video Coding (HEVC) standard is attained at the cost of a significant computational complexity of both the encoder and the decoder. In particular, the considerable amount of intra prediction modes that are now considered by this standard, together with the increased complexity of the adopted block coding tree structures using a larger diversity of transforms imposes demanding computational efforts that can hardly be satisfied by current general-purpose processors to attain hard real-time requirements. Furthermore, the strict data dependencies that are imposed make parallelization a difficult and hardly efficient option with conventional approaches. To circumvent this adversity, this paper exploits Graphics Processing Units (GPUs) to accelerate the intra decoding procedure in HEVC, encompassing the most demanding modules of the decoder (i.e., de-quantization, inverse transform, intra prediction, deblocking filter, and sample adaptive offset). The presented approaches comprehensively exploit both coarse and fine-grained parallelization opportunities in an integrated perspective by redesigning the execution pattern of the involved modules, while simultaneously coping with their inherent computational complexity and strict data dependencies. As a result, the proposed parallelization, which is fully compliant with the HEVC standard, has shown to be a remarkable viable approach, being capable of satisfying hard real-time requirements by processing each Ultra HD 4 K intra frame in less than 25 ms (about 40 fps).