Clock networks dissipate a significant fraction of the entire chip power budget. In contrast to most of the traditional works that handle the power optimization problem with clock routing or buffer sizing, we propose a novel register clustering methodology for power reduction of clock trees. Moreover, a fast three-stage clock tree synthesis (CTS) approach based on register clustering is presented to verify the validity of the methodology. By comparison with the state-of-the-art low power CTS research works Contango2.0 [21] and the CTS of Purdue University [16], our three-stage CTS approach achieves 1.30×, 1.07× smaller power consumption while exhibiting 2.01×, 1.52× smaller skew. Furthermore, the runtime of our CTS approach is 17.36×, 8.16× shorter than that of [21] and [16] respectively.