Future lithography systems must produce chips with smaller feature sizes, while maintaining throughput comparable to today's optical lithography systems. This places stringent data handling requirements on the design of any direct-write maskless system. To achieve the throughput of one wafer layer per minute with a direct-write maskless lithography system, using 22 nm pixels for 45 nm technology, a data rate of 12 Tb/s is required. In recent years, we have developed a datapath architecture for direct-write lithography systems, and have shown that lossless compression plays a key role in reducing throughput requirements of such systems. Our approach integrates a low complexity hardware-based decoder with the writers, in order to decode a compressed data layer in real time on the fly. In doing so, we have developed a spectrum of lossless compression algorithms for integrated circuit rasterized layout data to provide a tradeoff between compression efficiency and hardware complexity, the most promising of which is Block Golomb Context Copy Coding (Block GC3). In this paper, we present the FPGA synthesis and emulation results for the Block GC3 decoder. For one Block GC3 decoder, 3233 slice flip-flops and 3086 4-input LUTs are utilized in a Xilinx Virtex II Pro 70 FPGA, which corresponds to 4% of its resources, along with 1.7 KB of internal memory. The system runs at 100 MHz clock rate, with the overall output rate of 495 Mb/s for a single decoder. In addition to the decoder implementation results, we discuss other hardware implementation issues for the writer system data path, including on-chip input/output buffering, error propagation control, and input data stream packaging. This hardware data path implementation is independent of the writer systems or data link types, and can be integrated with arbitrary direct-write lithography systems.