Guided image filtering has been applied widely for increasing demand of high performance filtering, especially for real-time image/video processing. Gradient guided filter improves the filtering quality, reducing the halo-artifacts problem due to its edge-aware characteristics. However, the gradient guided filter algorithm has high computation complexity and the computation involves global pixels, which hinder its VLSI implementation for real-time full-HD application. This work addresses these issues and a VLSI architecture is proposed for the gradient guided image filter. Several design techniques are developed and used in the design to achieve high computation speed and high throughput. A seamless dataflow is proposed for the complete system that consists of three main processing stages, specifically the preprocessing stage, the linear coefficient computation stage and the output stage.The preprocessing stage applies a down-sampling technique with a large sampling rate to reduce computation cost in terms of circuit size, processing time and power consumption. The global parameter values are quickly derived with reasonable good global information maintained so that the quality of the filtering results are not sacrificed when these values are applied in the subsequent two stages.The linear coefficient computation stage contains the most complex computations such as square root, division and exponential function. Down-sampling technique is applied with a sampling rate lower than in the preprocessing stage so as to balance the computation cost and filtering accuracy. In addition, the intensive arithmetic computation modules that dominate the critical path delay are redesigned by using adequate approximated operations. Specifically, novel non-iterative dividers are developed to replace original dividers for reducing delays in the critical paths. With the proposed non-iterative division, ii the quotient of the division is modeled as a normalized curved surface. The curved surface is partitioned into small regions and approximated by smaller planes for efficient hardware implementation. Curve fitting method and mixed integer linear programming method are adopted and evaluated for local optimization of the approximation errors. In this way, the dividers are implemented with only simple arithmetic operations and a small look-up table. As a result, the operation is fast and the approximation errors are optimized while satisfying the accuracy requirement. The other intensive computation, the exponentiation function, is also simplified by piecewise linear approximation, and implemented with only shifters and adder trees. As such, the approximated computations are used to improve the computation performance and simplify the designs of the complex arithmetic modules.The output stage employs parallel processing and operates at a frequency 16 times higher to restore the filtering results to its original full frame size. The linear coefficients obtained from the downsampling stage are applied concurrently to all the 16 pixels i...