As the number of pixels per frame tends to increase in new high definition video coding standards such as HEVC and VP9, pel decimation appears as a viable means of increasing the energy efficiency of Sum of Absolute Differences (SAD) calculation. First, we analyze the quality costs of pel decimation using a video coding software. Then we present and evaluate two VLSI architectures to compute the SAD of 4x4 pixel blocks: one that can be configured with 1:1, 2:1 or 4:1 sampling ratios and a non-configurable one, to serve as baseline in comparisons. The architectures were synthesized for 90nm, 65nm and 45nm standard cell libraries assuming both nominal and Low-Vdd/High-Vt (LH) cases for maximum and for a given target throughput. The impacts of both subsampling and LH on delay, power and energy efficiency are analyzed. In a total of 24 syntheses, the 45nm/LH configurable SAD architecture synthesis achieved the highest energy efficiency for target throughput when operating in pel decimation 4:1, spending only 2.05pJ for each 4×4 block. This corresponds to about 13.65 times less energy than the 90nm/nominal configurable architecture operating in full sampling mode and maximum throughput and about 14.77 times less than the 90nm/nominal non-configurable synthesis for target throughput. Aside the improvements achieved by using LH, pel decimation solely was responsible for energy reductions of 40% and 60% when choosing 2:1 and 4:1 subsampling ratios, respectively, in the configurable architecture. Finally, it is shown that the configurable architecture is more energy-efficient than the non-configurable one.