This paper presents a new algorithm for on-the-fly data compression in high performance VLIW processors. The algorithm aggressively targets energy minimization of some of the dominant factors in the SoC energy budget (i.e., main memory access and high throughput global bus). Based on a differential technique, both the new algorithm and the HW compression unit have been developed to efficiently manage data compression and decompression into a high performance industrial processor architecture, under strict real time constraints (Lx-ST200: A 4-issue, 6-stages pipelined VLIW processor with on-chip D and I-Cache). The original Data-Cache line is compressed before write-back to main memory and, then, decompressed whenever Cache refill takes place. An extensive experimental strategy has been developed for the specific validation of the target Lx processor. In order to allow public comparison, we also report the results obtained on a MIPS pipelined RISC processor simulated with SimpleScalar. The two platforms have been benchmarked over Ptolemy and MediaBench programs. Energy savings provided by the application of the proposed technique range from 10% to 22% on the Lx-ST200 platform and from 11% to 14% on the MIPS platform.