This work aims at optimizing the hardware implementation of the SubBytes and inverse SubBytes operations in the advanced encryption standard (AES). To this, the composite field arithmetic (CFA) is employed to optimize all building blocks in S-box (and inverse S-box) of SubBytes (and inverse SubBytes) transformation. A joint design of S-box and inverse S-box is also proposed to further enhance the area efficiency. Specifically, the area of multiplier in the Galois composite field, GF((2 2 ) 2 ), is reduced. The squaring and multiplication with constant λ in GF((2 2 ) 2 ) are combined and optimized as well. Moreover, the multiplicative inversion in GF((2 2 ) 2 ) is manually optimized. Furthermore, the S-box and inverse Sbox are combined and optimized using the pre_processing and post_processing modules. To increase the throughput, a balanced and pipelined architecture is derived. Using the proposed architecture, a throughput of 5.79 Gbps for the S-box can be achieved on Virtex-6 XC6VLX240T and 10% better than the conventional work. According to the ASIC implementation result, the proposed design can still achieve the highest area efficiency and approximately 30% better than conventional works using TSMC 90nm process.INDEX TERMS Advanced encryption standard (AES), Composite field arithmetic (CFA), S-box, VLSI architecture.